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Message from the Program Co-Chairs 


Dear LISA 711 Attendee, 


There are two kinds of LISA attendees: those who read this letter at the conference and those who read it after 
they’ve returned home. To the first group, get ready for six days of brain-filling, technology-packed, geek-centric 
tutorials, speakers, papers, and more! To those that are reading this after the conference, we ask, “What’s it like 
living in the future? How was the conference? What cool tips and tools did you take home with you to make your 
job easier?” 


Being a sysadmin 1s kind of like living in the future. You work with technology every day that would make Buck 
Rogers jealous. Most of our friends are jealous, too. When LISA started 25 years ago, a “large site” had 10 comput- 
ers, each the size of a dishwasher, with a few gigabytes of combined storage. Today our cell phones have 32GB of 
“compact flash,” which is often more than the NFS quota we give our users. 


Attending LISA is kind of like spending a week living in the future. We learn technologies that are cutting-edge— 
little known now, but next year everyone will be talking about them. When we return from LISA we sound like 
time travelers visiting from the future talking about new and futuristic stuff. LISA makes us look good. 


LISA rarely has a cohesive conference theme, but this year we thought it was important to highlight DevOps, as it 
is a Significant cultural change. Although DevOps 1s often thought of as “something big Web sites do,” the lessons 
learned transfer well to enterprise computing. 


LISA has always been assembled using the sweat of many dedicated volunteers. It takes a lot of effort to put a 
conference like this together, and this year is no different. Most prominent are the Invited Talks committee (4leen 
Frisch and Kent Skaar) and the Program Committee (Narayan Desai, Andrew Hume, Duncan Hutty, Dinah 
McNutt, Tim Nelson, Mario Obejas, Mark Roth, Carolyn Rowland, Federico D. Sacerdoti, Marc Stavely, Nicole 
Forsgren Velasquez, Avleen Vig, and David Williamson), but also important are the Workshops Coordinator (Cory 
Lueninghoener), the Guru Is In Coordinator (Chris St. Pierre), the Poster Session Coordinator (Matt Disney), and 
the Work-in-Progress Reports Coordinator (William Bilancio). We couldn’t have done it without every one of them. 
Of course, nothing would happen without the leadership of the USENIX staff. We are indebted to you all! 


Of the 63 papers submitted, we accepted 28. These papers represent the best “deep thought” research, as well as 
Practice and Experience Reports that tell the stories from people “in the trenches.” We encourage you to read them 
all. However, the power of LISA is the personal interaction: introduce yourself to the attendees standing in line 
near you, strike up a conversation with the person sitting next to you. And remember to have fun! 


Sincerely, 

Thomas A. Limoncelli, Google, Inc. 
Doug Hughes, D. E. Shaw Research, LLC 
Program Co-Chairs 
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Staging Package Deployment via Repository Management 


Chris St. Pierre - stpierreca@ornl.gov 
Matt Hermanson - mjhermanson@ornl.gov 
National Center for Computational Sciences 

Oak Ridge National Laboratory 
Oak Ridge, TN, USA* 


Abstract 


This paper describes an approach for managing package versions and updates in a homogenous manner 
across a heterogenous environment by intensively managing a set of software repositories rather than by 
managing the clients. This entails maintaining multiple local mirrors, each of which is aimed at a different 
class of client: One is directly synchronized from the upstream repositories, while others are maintained 
from that repository according to various policies that specify which packages are to be automatically 
pulled from upstream (and therefore automatically installed without any local vetting) and which are to 
be considered more carefully — likely installed in a testing environment, for instance — before they are 


deployed widely. 
Background 


It is important to understand some points about our 
environment, as they provide important constraints 
to our solution. 

We are lucky enough to run a fairly homoge- 
nous set of operating systems consisting primarily of 
Red Hat Enterprise Linux and CentOS servers, with 
fair numbers of Fedora and SuSE outliers. In short, 
we are dealing entirely with RPM-based packaging, 
and with operating systems that are capable of using 
yum [12]. As yum is the default package manage- 
ment utility for the majority of our servers, we opted 
to use yum rather than try to switch to another pack- 
age management utility. 

For configuration management, we chose to use 
Bcfg2 [3] for reasons wholly unrelated to package and 
software management. Bcfg2 is a Python and XML- 
based configuration management engine that “helps 
system administrators produce a consistent, repro- 
ducible, and verifiable description of their environ- 
ment” [3]. It is in particular the focus on repro- 
ducibility and verification that forced us to consider 
updating and patching anew. 

In order to guarantee that a given configuration — 


where a “configuration” is defined as the set of paths, 
files, packages, and so forth, that describes a single 
system — is fully replicable, Bcfg2 ensures that ev- 
ery package specified for a system is the latest avail- 
able from that system’s software repositories [8]. (As 
will be noted, this can be overridden by specifying 
an explicit package version.) This grants the system 
administrator two important abilities: to provision 
identical machines that will remain identical; and to 
reprovision machines to the exact same state they 
were previously in. But it also makes it unreasonable 
to simply use the vendor’s software repositories (or 
other upstream repositories), since all updates will be 
installed immediately without any vetting. The same 
problem presents itself even with a local mirror. 


Bcfg2 can also use “the client’s response to the 
specification ... to assess the completeness of the 
specification” [3]. For this to happen, the Bcfg2 
server must be able to understand what a “com- 
plete” specification entails, and so the server does 
not entirely delegate package installation to the Bcfg2 
client. Instead, it performs package dependency res- 
olution on the server rather than allowing the client 
to set its own configuration. ‘This necessitates en- 
suring that the Bcfg2 Packages plugin uses the same 


“This paper has been authored by contractors of the U.S. Government under Contract No. DE-AC05-000R22725. Ac- 
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yum configuration as the clients; Bcfg2 has support 
for making this rather simple [8], but the Packages 
plugin does not support the full range of yum func- 
tionality, so certain functions like the “versionlock” 
plugin and even package excludes, are not available. 
Due to the architecture of Bcfg2 — architecture de- 
signed to guarantee replicability and verification of 
server configurations — it is not feasible or, in most 
cases, possible to do client-based package and repos- 
itory management. ‘This became critically important 
in selecting a solution. 


Other Solutions 


There are a vast number of potential solutions to this 
problem that would seem to be low-hanging fruit — 
far simpler to implement, at least initially, than our 
ultimate solution — but that would not work, for var- 
ious reasons. 


Yum Excludes 


A core yum feature is the ability to exclude certain 
packages from updates or installation [13]. At first, 
this would seem to be a solution to the problem of 
package versioning: simply install the package version 
you want, and then exclude it from further updates. 
But this has several issues that made it unsuitable 
for our use (or, we believe, this use case in general): 


e It does not (and cannot) guarantee a specific 
version. Using excludes to set a version depends 
on that version being installed (manually) prior 
to adding the package to the exclude list. 


e ‘There is no guarantee that the package is still in 
the repository. Many mainstream repositories! 
do not retain older versions in the same repos- 
itory as current packages. Consequently, when 
reinstalling a machine where yum excludes have 
been used to set package versions (or when at- 
tempting to duplicate such a machine), there is 
no guarantee that the package version expected 
will even be available. 


e In order to use yum excludes to control package 
versions, a very specific order of events must oc- 
cur: first, the machine must be installed with- 
out the target package included (as Kickstart, 
the RHEL installation tool, does not support 
installing a specific version of a package [1]); 
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next, the correct package version must be in- 
stalled; and finally, the package must be added 
to the exclude list. If this happens out of order, 
then the wrong version of the package might be 
installed, or the package might not be installed 
at all. 


e Supplying a permitted update to a package is 
even more difficult, as it involves removing the 
package exclusion, updating to the correct ver- 
sion, and then restoring the exclusion. A config- 
uration management system would have to have 
tremendously granular control over the order in 
which actions are performed to accomplish this 
delicate goal. 


e As discussed earlier, Bcfg2 performs depen- 
dency resolution on the server side in order to 
provide a guarantee that a client’s configura- 
tion is fully specified. By using yum excludes — 
which cannot be configured in Bcfg2’s internal 
dependency resolver — the relationship between 
the client and the server is broken, and Bcfg2 
will in perpetuity claim that the client is out of 
sync with the server, thus reducing the useful- 
ness of the Bcfg2 reporting tools. 


While yum excludes appear at first to be a viable 
option, their use to set package versions is not repli- 
cable, consistent, and cannot be trivially automated. 


Specifying Versions in Bcfg2 


Bcfg2 is capable of specifying specific versions of 
packages in the specification, e.g.: 


<BoundPackage name="glibc" type="yum"> 
<Instance version="2.13" 
arch="i686"/> 
<Instance version="2.13" 
arch="x86_64"/> 
</BoundPackage> 


release="1" 


release="1" 


This is obviously quite verbose (more so because 
the example uses a multi-arch package), and as a re- 
sult of its verbosity it is also error-prone. Having 
to recopy the version, release, and architecture of a 
package — separately — is not always a trivial process, 
and the relatively few constraints of version and re- 
lease strings makes it less so. For instance, given the 
package: 


iomemory-vsl-2.6.35.12-88.f£c14.x86_64- 
2.3.0.281-1.0.£c14.x86_64.rpm 
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The package name is “iomemory-vsl-2.6.35.12- 
88.fc14.x86_64” (which refers to the specific kernel for 
which it was built), the version is “2.3.0.281” and the 
release is “1.0.fc14”.2 This can be clarified through 
use of the --queryformat option to rpm, but the fact 
that more advanced RPM commands are necessary 
makes it clear that this approach is untenable in gen- 
eral. Even more worrisome is the package epoch, a 
sort of “super-version,” which RPM cleverly hides by 
default, but could cause a newer package to be in- 
stalled if it was not specified properly. 

Maintenance is also tedious, as it involves end- 
lessly updating verbose version strings; recall that a 
given version is just shorthand for what we actually 
care about — that a package works. 

This approach also does not abrogate the use of 
yum on a system to update it beyond the appropriate 
point. ‘The only thing keeping a package at the chosen 
version is Bcfg2’s own self-restraint; if an admin on 
a machine lacks that same self-restraint, then he or 
she could easily update a package that was not to be 
updated, whereupon Bcfg2 would try to downgrade 
Ait 

Finally, this approach presents specific difficulties 
for us, as our adoption of Bcfg2 is far from com- 
plete; large swaths of the center still use Cfengine 2, 
and some machines — particularly compute and stor- 
age platforms — operate in a diskless manner and do 
not use configuration management tools in a tradi- 
tional manner. ‘They depend entirely on their images 
for package versions, so specifying versions in Bcfg2 
would not help. 

To clarify, using Bcfg2 forced us to reconsider this 
problem, and any solution must be capable of work- 
ing with Bcfg2, but it cannot be assumed that the 
solution may leverage Bcfg2. 


Yum versionlock 


Using yum’s own version locking system would ap- 
pear to improve upon pegging versions in Bcfg2: 
it works on all systems, regardless of whether or 
not they use Bcfg2; and a shortcut command, yum 
versionlock <package-name>, is provided to make 
the process of maintaining versions less error-prone.® 

It also solves many of the problems of yum ex- 
cludes, but suffers from a critical flaw in that ap- 
proach: by setting package versions on the client, 
the relationship between the Bcfg2 client and server 
would be broken. 

Combinations of these three approaches merely 
exhibit combinations of their flaws. For instance, 


the promising combination of yum’s versionlock plu- 
gin and specifying the version in Bcfg2 would ensure 
that the Bcfg2 client and server were of a mind about 
package versions, and would work on non-Bcfg2 ma- 
chines; however, it would forfeit versionlock’s ease of 
use and require the administrator to once again man- 
ually copy package versions. 


Spacewalk 


Spacewalk was the first full-featured solution we 
looked at that aims to replace the mirroring portion 
of this relationship; all of the other potential solu- 
tions listed thus far have attempted to work with a 
“dumb” mirror and use yum features to work around 
the problem we have described. Spacewalk is a local 
mirror system that “manages software content up- 
dates for Red Hat derived [sic] distributions” [10]; it 
is a tremendously full-featured system, with support 
for custom “channels,” collections of packages assem- 
bled in an ad-hoc basis. 

Unfortunately, Spacewalk was a non-starter for us 
for the same reason that it has failed to gain much 
traction in the community at large: of the two ver- 
sions of Spacewalk, only the Oracle version actually 
implements all of the features; the PostgreSQL ver- 
sion is deeply underfeatured, even after several years 
of work by the Spacewalk team to port all of the Or- 
acle stored procedures. 

As it turns out, Red Hat has a successor in 
mind for Spacewalk and Satellite: CloudForms [14]. 
The content management portion of CloudForms — 
roughly corresponding to the mirror and repository 
management functionality of Spacewalk — is Pulp. 


A solution: Pulp 


Pulp is a tool “for managing software repositories 
and their associated content, such as packages, er- 
rata, and distributions” [7]. It is, as noted, the spir- 
itual successor to Spacewalk, and so implements the 
vast majority of Spacewalk’s repository management 
features without the dependency on Oracle. 

Pulp’s usage model involves syncing multiple up- 
stream repositories locally; these repositories can 
then be cloned, which uses hard links to sync them 
locally with almost no disk space used. ‘This allows 
us to sync a repository once, then duplicate it as 
many times as necessary to support multiple teams 
and multiple stability levels. The sync process sup- 
ports filters, which allow us to blacklist or whitelist 
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packages and thus exclude “impactful” packages from 
automatic updates. 

Pulp also supports manually adding packages to 
and removing packages from repositories, so we can 
later update a given package across all machines that 
use a repository with a single command. Adding and 
removing also tracks dependencies, so it’s not possi- 
ble to add a package to a repository without adding 
the dependencies necessary to install it.4 





Workflow 


Pulp provides us with the framework to implement 
a solution to the problem outlined earlier, but even 
as featureful as it is it remains a fairly basic tool. 
Our workflow — enforced by the features Pulp pro- 
vides, by segregating repositories, by policy, and by 
a nascent in-house web interface — provides the bulk 
of the solution. Briefly, we segregate repositories by 
tier to test packages before site-wide roll-outs, and by 
team to ensure operational separation. Packages are 
automatically synced between tiers based on package 
filters, which blacklist certain packages that must be 
promoted manually. This ensures that most packages 
benefit from up to two weeks of community testing 
before being deployed site-wide, and packages that 
we have judged to be more potentially “impactful” 
from more focused local testing as well. 


Tiered Repositories 


We maintain different repository sets for different 
“levels” of stability. We chose to maintain three tiers: 


live Synced daily from upstream repositories; not 
used on any machines, but maintained due to 
operational requirements within Pulp® and for 
reference. 


unstable Synced daily from live, with the excep- 
tion of selected “impactful” packages (more 
about which shortly), which can be manually 
promoted from live. 


stable Synced daily from unstable, with the excep- 
tion of the same “impactful” packages, which 
can be manually promoted from unstable. 


This three-tiered approach guarantees that pack- 
ages in stable are at least two days old, and “im- 
pactful” packages have been in testing by machines 
using the unstable branch. When a package is re- 
leased from upstream and sync to public mirrors, 
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those packages are pulled down into local reposito- 
ries. From then on the package in under the control 
of Pulp. Initially, a package is considered unstable 
and is only deployed to those systems that look at 
the repositories in the unstable tier. After a period 
of time, the package is then promoted into the stable 
repositories, and thus to production machines. 

In order to ensure that packages in unstable re- 
ceive ample testing before being promoted to stable, 
we divide machines amongst those two tiers thusly: 


e All internal test machines — that is, all machines 
whose sole purpose is to provide test and de- 
velopment platforms to customers within the 
group — use the unstable branch. Many of 
these machines are similar, if not identical, to 
production or external test machines. 


e Where multiple identical machines exist for a 
single purpose, whether in an active-active or 
active-passive configuration, exactly one ma- 
chine will use the unstable branch and the rest 
will use the stable branch. 


Additionally, we maintain separate sets of repos- 
itories, branched from live, for different teams or 
projects that require different patching policies ap- 
propriate to the needs of those teams or projects. 
Pulp has strong built-in ACLs that support these di- 
visions. 

In order to organize multiple tiers across multi- 
ple groups, we use a strict convention to specify the 
repository ID, which acts as the primary key across 
all repositories®, namely: 


<team name>-<tier>-<os name>-<os version>- 
<arch>-<repo name> 


For example, 
infra-unstable-centos-6-x86_64-updates would 
denote the Infrastructure team’s unstable tier of the 
64-bit CentOS 6 “updates” repository. This allows us 
to tell at a glance the parent-child relationships be- 
tween repositories. 


Sync Filters 


The syncs between the live and unstable and be- 
tween unstable and stable tiers are mediated by 
filters’. Filters are regular expression lists of pack- 
ages to either blacklist from the sync, or whitelist in 
the sync; in our workflow, only blacklists are used. A 
package filtered from the sync may still remain in the 
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repository; that is, if we specify “kernel (-.*)? asa 
blacklist filter, that does not remove kernel packages 
from the repository, but rather refuses to sync new 
kernel packages from the repository’s parent. ‘This 
is critical to our version-pegging system. 

Given our needs, whitelist filters are unnecessary; 
our systems tend to fall into one of two types: 


e Systems where we generally want updates to 
be installed insofar as is reasonable, with some 
prudence about installing updates to “impact- 
ful” packages. 


e Systems where, due to vendor requirements, we 
must set all packages to a specific version. Most 
often this is in the form of a requirement for a 
minor release of RHEL®, in which case there are 
no updates we wish to install on an automatic 
basis. (We may wish to update specific pack- 
ages to respond to security threats, but that 
happens with manual package promotion, not 
with a sync; this workflow gives us the flexibil- 
ity necessary to do so.) 


A package that may potentially cause issues when 
updated can be blacklisted on a per-team basis’. 
Since the repositories are hierarchically tiered, a 
package that is blacklisted from the unstable tier 
will never make it to the stable tier. 


Manual Package Promotion and Removal 


The lynchpin of this process is manually reviewing 
packages that have been blacklisted from the syncs 
and promoting them manually as necessary. For in- 
stance, if a filter for a set of repositories blacklisted 
“kernel(-.*)? from the sync, without manually 
promoting new kernel packages no new kernel would 
ever be installed. 

To accomplish this, we use Pulp’s add package 
functionality, exposed via the REST API as a POST 
to 
/repositories/<id>/add_package/, 
Python client API as 
pulp.client.api.repository.RepositoryAPI. 
add_package(), and via the CLI as pulp-admin 
repo add_package. In the CLI implementation, 
add_package follows dependencies, so promoting a 
package will promote everything that package re- 
quires that is not already in the target repository. 
This helps ensure that each repository stays consis- 
tent even as we manipulate it to contain only a subset 
of upstream packages??. 


via the 


Conversely, if a package is deployed and is later 
found to cause problems it can be removed from the 
tier and the previous version, if such is available in 
the repository, will be (re)installed. Befg2 will help- 
fully flag machines where a newer package is installed 
than is available in that machine’s repositories, and 
will try to downgrade packages appropriately. Pulp 
can be configured to retain old packages when it per- 
forms a sync; this is helpful for repositories like EPEL 
that remove old packages themselves, and guarantees 
that a configurable number of older package versions 
are available to fall back on. 

The remove package functionality is exposed via 
Pulp’s REST API as a POST to 
/repositories/<id>/delete_package/, 
Python client API as 
pulp.client.api.repository.RepositoryAPI. 
remove_package(), and via the CLI as pulp-admin 
repo remove_package. As with add_package, the 
CLI implementation follows dependencies and will 
try to remove packages that require the package 
being removed; this also helps ensure repository con- 
sistency. 

Optimally, security patches are applied 10 or 30 
days after the initial patch release [2]; this workflow 
allows us to follow these recommendations to some 
degree, promoting new packages to the unstable tier 
on an approximately weekly basis. Packages that 
have been in the unstable tier for at least a week 
are also promoted to the stable tier every week; in 
this we deviate from Beattie et al.’s recommendations 
somewhat, but we do so because the updates being 
promoted to stable have been vetted and tested by 
the machines using the unstable tier. 

This workflow also gives us something very impor- 
tant: the ability to install updates across all machines 
much sooner than the optimal 10- or 30-day period. 
High profile vulnerabilities require immediate action 
—even to the point of imperiling uptime — and by pro- 
moting a new package immediately to both stable 
and unstable tiers we can ensure that it is installed 
across all machines in our environment in a timely 
fashion. 


via the 


Selecting “impactful” packages 


Throughout this paper, we have referred to “impact- 
ful” packages — those to which automatic updates 
we determined to be particularly dangerous — as a 
driving factor. Were it not for our reticence to au- 
tomatically update all packages, we could have sim- 
ply used an automatic update facility — yum-cron or 
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yum-updatesd are both popular — and been done with 
itty: 

We didn’t feel that was appropriate, though. For 
instance, installing a new kernel can be problematic 
— particularly in an environment with a wide variety 
of third-party kernel modules and other kernel-space 
modifications — and we wanted much closer control 
over that process. We flagged packages as “impact- 
ful” according to a simple set of criteria: 


e ‘The kernel, and packages otherwise directly tied 
to kernel space (e.g., kernel modules and Dy- 
namic Kernel Module Support (DKMS) pack- 


ages); 


e Packages that provide significant, customer- 
facing services. On the Infrastructure team, 
this included packages like bind, httpd (and 
related modules), mysql, and so on. 


e Packages related to InfiniBand and Lustre [9]; 
as one of the world’s largest unclassified Lustre 
installations, it’s very important that the Lus- 
tre versions on our systems stay in lockstep with 
all other systems in the center. Parts of Lus- 
tre reside directly in kernel space, an additional 
consideration. 


The first two criteria provided around 20 packages 
to be excluded — a tiny fraction of the total packages 
installed across all of our machines. The vast major- 
ity of supporting packages continue to be automati- 
cally updated, albeit with a slight time delay for the 
multiple syncs that must occur. 


Results 


Our approach produces results in a number of ar- 
eas that are difficult to quantify: improved au- 
tomation reduces the amount of time we spend in- 
stalling patches; not installing patches immediately 
improves patch quality and reduces the likelihood of 
flawed patches [2]; and increased compartmentaliza- 
tion makes it easier for our diverse teams to work 
to different purposes without stepping on toes. But 
it also provides testable, quantifiable improvements: 
since replacing a manual update process with Pulp 
and Bcfg2’s automated update process, we can see 
that the number of available updates has decreased 
and remained low on the machines using Pulp. 
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The practice of staging package deployment 
makes is difficult to quantify just how out of date 
a client is, as yum on the client will only report the 
number of updates available from the repositories in 
yum.conf. ‘To find the number of updates available 
from upstream, we collect an aggregate of all the 
package differences starting at the client and going 
up the heirarchy to the upstream repository. E.g., 
for a machine using the unstable tier, we calculate 
the number of updates available on the machine it- 
self, and then the number of updates available to the 
unstable tier from the live tier. 


The caveat to this approach is when, for instance, 
a package splits into two new packages. This results 
in two new packages, and one missing package, total- 
ing three “updates” according to yum check-update, 
or zero “updates” when comparing repositories them- 
selves, when in reality it is a single package update. 
For example, if package foo recieves an update that 
results in packages foo-client and foo-server, this 
could result in a margin of error of -1 or +2. This 
gives a slight potential benefit to machines using Pulp 
in our metrics, as updates of this sort are underesti- 
mated when calculating the difference between repos- 
itories, but overestimated when using yum to report 
on updates available to a machine. In practice, this is 
extremely rare, though, and should not significantly 
affect the results. 





Ensuring, with a high degree of confidence, that 
updates are installed is wonderful, but even more 
important is ensuring that vulnerabilities are being 
mitigated. Using the data from monthly Nessus [11] 
vulnerability scans, we can see that machines using 
Pulp do indeed reap the benefits of being patched 
with more frequency:'! 
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Vulnerabilities 








Servers using Pulp 


Servers not using Pulp 


This graph is artificially skewed against Pulp due 
to the sorts of things Nessus scans for; for instance, 
web servers are more likely to be using Pulp at this 
time simply due to our implementation plan, and 
they also have disproportionately more vulnerabili- 
ties in Nessus because they have more services ex- 
posed. 


Future Development 


Sponge 


At this time, Pulp is very early code; it has been in 
use in another Red Hat product for a while, so certain 
paths are well-tested, but other paths are pre-alpha. 
Consequently, its command line interface lacks pol- 
ish, and many tasks within Pulp require extraordi- 
nary verbosity to accomplish. It is also not clear if 
Pulp is intended for standalone use, although such is 
possible. 

To ease management of Pulp, we have written a 
web frontend for management of Pulp and its objects, 
called “Sponge.” Sponge, powered by the Django [4] 
web framework, provides views into the state of Pulp 
repositories along with the ablity to manage its con- 
tents. Sponge leverages Pulp’s Python client API to 
provide convience functions that ease our workflow. 

By presenting the information visually, Sponge 
makes repository management much more intuitive. 
Sponge extends the functionality of Pulp by display- 
ing the differences between a repository and its parent 
in the form of a diff. These diffs give greater insight 
into exactly how stable, unstable, and live tiers 
differ. They also provide insight into the implications 
of a package promotion or removal. 

This is particularly important with package re- 
moval, since, as noted, removing a package will also 





remove anything that requires that specific package. 
Without Sponge’s diff feature and a confirmation 
step, that is potentially very dangerous; Pulp itself 
only gives you confirmation of the packages removed 
without an opportunity to confirm or reject a re- 
moval. The contrapositive situation — promoting a 
package pulling in unintended dependencies — is also 
potentially dangerous, albeit less so. Sponge helps 
avert both dangers. 





Guaranteeing a minimum package age 


As Beattie at al. observe [2], the optimal time to ap- 
ply security patches is either 10 or 30 days after the 
patches have been released. Our workflow currently 
doesn’t provide any way to guarantee this; our weekly 
manual promotion of new packages merely suggests 
that a patch be somewhere between 0 and 6 days old 
before it is promoted to unstable, and 7 and 13 days 
old before being promoted to stable. We plan to add 
a feature — either to Sponge or to Pulp — to promote 
packages only once they have aged properly. 





Other packaging formats 


In this paper we have dealt with systems using yum 
and RPM, but the approach can, at least in theory, be 
expanded to other packaging systems. Pulp intends 
eventually to support not only Debian packages, but 
actually any sort of generic content at all [6], mak- 
ing it useful for any packaging system. Bcfg2, for 
its part, already has package drivers for a wide array 
of packaging systems, including APT, Solaris pack- 
ages (Blastwave- or SystemV-style), Encap, FreeBSD 
packages, IPS, Mac Ports, Pacman, and Portage. 
This gives a hint of the future potential for this ap- 
proach. 


Availability 
Most of the software involved in the approach dis- 


cussed in this paper is free and open source. ‘The 
various elements of our solution can be found at: 


Pulp http://pulpproject.org 


Becfg2 http://trac.mcs.anl.gov/projects/ 
bcfg2 


Yum http://yum.baseurl.org/ 
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Sponge, the web UI to Pulp listed in the Future 
Development section, is currently incomplete and un- 
released. We have already worked closely with the 
Pulp developers to incorporate features into the Pulp 
core itself, and we will continue to do so. We hope 
that Sponge will become unnecessary as Pulp ma- 
tures. 
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Notes 


‘For instance, Extra Packages for Enterprise Linux (EPEL) 
and the CentOS repositories themselves. 

2 Admittedly, this is a non-standard naming scheme, but 
no solution can be predicated on the idea that all RPMs are 
well-built. 

3The command in question merely maintains a local file on 
a machine, so that file would still have to be copied into the 
Bcfg2 specification, but we believe this would be less error- 
prone than copying package version details. 

“This is actually only true if the package is being added 
from another repository; it is possible to add a package di- 
rectly from the filesystem, in which case dependency checking 
is not performed. This is not a use case for us, though. 

°In Pulp, filters can only be applied to repositories with 
local feeds. 

This may change in future versions of Pulp, as multiple 
users, ourselves included, have asked for stronger grouping 
functionality [5]. 

7As noted earlier, in Pulp, filters can only be applied to 
repositories with local feeds, so no filter mediates the sync be- 
tween upstream and live. 

SIt is lost on many vendors that it is unreasonable and fool- 
ish to require a specific RHEL minor release. As much work 
as has gone into this solution, it is still less than would be 
required to convince most vendors of this fact, though. 

° Technically, filters can be applied on a per-repository basis, 
so black- and whitelists can be applied to individual reposito- 
ries. This is very rare in our workflow, though. 

10Tt is true that our approach does not guarantee consistency. 
A repository sync might result in an inconsistency if a package 
that was not listed on that sync’s blacklist required a package 
that was listed on the blacklist. In practice this can be limited 
by using regular expressions to filter families of packages (e.g., 
“mysql.* or ~(.*-)?mysql.* to blacklist all MySQL-related 
packages rather than just blacklisting the mysql-server pack- 
age itself 

11 Unfortunately long-term data was not available for vul- 
nerabilities for a number of reasons: CentOS 5 stopped ship- 
ping updates in their mainline repositories between July 21st 
and September 14th; the August security scan was partially 
skipped; and Pulp hasn’t been in production long enough to 
get meaningful numbers prior to that. Still, the snapshot of 
data is compelling. 
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Abstract 


There is a huge ecosystem of free software for Linux, but 
since each Linux distribution (distro) contains a differ- 
ent set of pre-installed shared libraries, filesystem layout 
conventions, and other environmental state, it is difficult 
to create and distribute software that works without has- 
sle across all distros. Online forums and mailing lists 
are filled with discussions of users’ troubles with com- 
piling, installing, and configuring Linux software and 
their myriad of dependencies. To address this ubiqui- 
tous problem, we have created an open-source tool called 
CDE that automatically packages up the Code, Data, and 
Environment required to run a set of x86-Linux pro- 
grams on other x86-Linux machines. Creating a CDE 
package is as simple as running the target application un- 
der CDE’s monitoring, and executing a CDE package re- 
quires no installation, configuration, or root permissions. 
CDE enables Linux users to instantly run any application 
on-demand without encountering “dependency hell’. 


1 Introduction 


The simple-sounding task of taking software that runs on 
one person’s machine and getting it to run on another 
machine can be painfully difficult in practice. Since no 
two machines are identically configured, it is hard for 
developers to predict the exact versions of software and 
libraries already installed on potential users’ machines 
and whether those conflict with the requirements of their 
own software. Thus, software companies devote con- 
siderable resources to creating and testing one-click in- 
stallers for products like Microsoft Office, Adobe Pho- 
toshop, and Google Chrome. Similarly, open-source de- 
velopers must carefully specify the proper dependencies 
in order to integrate their software into package manage- 
ment systems [4] (e.g., RPM on Linux, MacPorts on Mac 
OS X). Despite these efforts, online forums and mail- 
ing lists are still filled with discussions of users’ troubles 


with compiling, installing, and configuring software and 
their myriad of dependencies. For example, the official 
Google Chrome help forum for “install/uninstall issues” 
has over 5800 threads. 

In addition, a study of US labor statistics predicts that 
by 2012, 13 million American workers will do program- 
ming in their jobs, but amongst those, only 3 million will 
be professional software developers [24]. Thus, there are 
potentially millions of people who still need to get their 
software to run on other machines but who are unlikely 
to invest the effort to create one-click installers or wres- 
tle with package managers, since their primary job is not 
to release production-quality software. For example: 


e System administrators often hack together ad- 
hoc utilities comprised of shell scripts and custom- 
compiled versions of open-source software, in or- 
der to perform system monitoring and maintenance 
tasks. Sysadmins want to share their custom-built 
tools with colleagues, quickly deploy them to other 
machines within their organization, and “future- 
proof” their scripts so that they can continue func- 
tioning even as the OS inevitably gets upgraded. 


e Research scientists often want to deploy their com- 
putational experiments to a cluster for greater per- 
formance and parallelism, but they might not have 
permission from the sysadmin to install the required 
libraries on the cluster machines. They also want to 
allow colleagues to run their research code in order 
to reproduce and extend their experiments. 


e Software prototype designers often want clients to 
be able to execute their prototypes without the has- 
sle of installing dependencies, in order to receive 
continual feedback throughout the design process. 


In this paper, we present an open-source tool called 
CDE [1] that makes it easy for people of all levels of 
IT expertise to get their software running on other ma- 
chines without the hassle of manually creating a robust 
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Figure 1: CDE enables users to package up any Linux 
application and deploy it to all modern Linux distros. 


installer or dealing with user complaints about depen- 
dencies. CDE automatically packages up the Code, Data, 
and Environment required to run a set of x86-Linux pro- 
grams on other x86-Linux machines without any instal- 
lation (see Figure 1). To use CDE, the user simply: 


1. Prepends any set of Linux commands with the cde 
executable. cde executes the commands and uses 
ptrace system call interposition to collect all the 
code, data files, and environment variables used 
during execution into a self-contained package. 


2. Copies the resulting CDE package to an x86-Linux 
machine running any distro from the past ~5 years. 


3. Prepends the original packaged commands with the 
cde-exec executable to run them on the target 
machine. cde-exec uses ptrace to redirect file- 
related system calls so that executables can load 
the required dependencies from within the package. 
Execution can range from ~0% to ~30% slower. 


The main benefits of CDE are that creating a package 
IS as easy as executing the target program under its super- 
vision, and that running a program within a package re- 
quires no installation, configuration, or root permissions. 

The design philosophy underlying CDE is that people 
should be able to package up their Linux software and 
deploy it to other Linux machines with as little effort as 
possible. However, CDE is not meant to replace tradi- 
tional installers or package managers; its intended role is 
to serve as a convenient ad-hoc solution for people like 
sysadmins, research scientists, and prototype makers. 

Since its release in Nov. 2010, CDE has been down- 
loaded over 3,000 times [1]. We have exchanged hun- 
dreds of emails with users throughout both academia and 
industry. In the past year, we have made several signifi- 
cant enhancements to the base CDE system in response to 
user feedback. Although we introduced an early version 
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Figure 2: CDE’s streaming mode enables users to run any 
Linux application on-demand by fetching the required 
files from a farm of pre-installed distros in the cloud. 


of CDE in a short paper [20], this paper presents a more 
complete CDE system with three new features: 


e To overcome CDE’s primary limitation of only be- 
ing able to package dependencies collected on exe- 
cuted paths, we introduce new tools and heuristics 
for making CDE packages complete (Section 3). 


e To make CDE-packaged programs behave just like 
native applications on the target machine rather than 
executing in an isolated sandbox, we introduce a 
new seamless execution mode (Section 4). 


e Finally, to enable users to run any Linux application 
on-demand, we introduce a new application stream- 
ing mode (Section 5). Figure 2 shows its high-level 
architecture: The system administrator first installs 
multiple versions of many popular Linux distros in 
a “distro farm” in the cloud (or an internal com- 
pute cluster). The user connects to that distro farm 
via an ssh-based protocol from any x86-Linux ma- 
chine. The user can now run any application avail- 
able within the package managers of any of the dis- 
tros in the farm. CDE’s streaming mode fetches the 
required files on-demand, caches them locally on 
the user’s machine, and creates a portable distro- 
independent execution environment. Thus, Linux 
users can instantly run the hundreds of thousands of 
applications already available in the package man- 
agers of all distros without being forced to use one 
specific release of one specific distro!. 


This paper continues with descriptions of real-world 
use cases (Section 6), evaluations of portability and per- 
formance (Section 7), comparisons to related work (Sec- 
tion 8), and concludes with discussions of design philos- 
ophy, limitations, and lessons learned (Section 9). 


'The package managers included in different releases of the same 
Linux distro often contain incompatible versions of many applications! 
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Figure 3: Example use of CDE: 1.) Alice runs her com- 
mand with cde to create a package, 2.) Alice sends her 
package to Bob’s computer, 3.) Bob runs command with 
cde-exec, which redirects file accesses into package. 


2 CDE system overview 


We described the details of CDE’s design and implemen- 
tation in a prior paper and its accompanying technical 
report [20]. We will now summarize the core features of 
CDE using an example. 

Suppose that Alice is a system administrator who is 
developing a Python script to detect anomalies in net- 
work log files. She normally runs her script using this 
Linux command: 


python detect_anomalies.py net.log 


Suppose that Alice’s script (detect_anomalies.py) 
imports some 3rd-party Python extension modules, 
which consist of optimized C++ log parsing code com- 
piled into shared libraries. If Alice wants her colleague 
Bob to be able to run her analysis, then it is not sufficient 
to just send her script and net . log data file to him. 

Even if Bob has a compatible version of Python on his 
Linux machine, he will not be able to run her script until 
he compiles, installs, and configures the exact extension 
modules that her script used (and all of their transitive 
dependencies). Since Bob is probably using a different 
Linux distribution (distro) than Alice, even if Alice pre- 
cisely recalled all of the steps involved in installing all of 
the original dependencies on her machine, those instruc- 
tions probably will not work on Bob’s machine. 


program 


kernel 
open file 





copy file into package 


Figure 4: Timeline of control flow between target pro- 
gram, kernel, and cde process during an open syscall. 


2.1 Creating a new CDE package 


To create a self-contained package with all of the depen- 
dencies required to run her anomaly detection script on 
another Linux machine, Alice simply prepends her com- 
mand with the cde executable: 


cde python detect_anomalies.py net.log 


cde runs her command normally and uses the Linux 
ptrace system call to monitor all of the files it ac- 
cesses throughout execution. cde creates a new sub- 
directory called cde-package/cde-root/ and copies 
all of those accessed files into there, mirroring the orig- 
inal directory structure. Figure 4 shows an overview of 
the control flow between the target program, Linux ker- 
nel, and cde during a file-related system call. 

For example, if Alice’s script dynamically 
loads an extension module as a_ shared library 
named /usr/lib/logutils.so (ie. log pars- 
ing utility code), then cde will copy it to 
cde-package/cde-root/usr/lib/logutils.so 
(see Figure 3). cde also saves the values of environment 
variables in a text file within cde-package/. 

When execution terminates, the cde-package/ sub- 
directory (which we call a “CDE package’’) contains all 
of the files required to run Alice’s original command. 


2.2 Executing a CDE package 


Alice zips up the cde-package/ directory and transfers 
it to Bob’s Linux machine. Now Bob can run Alice’s 
anomaly detection script without first installing anything 
on his machine. To do so, he unzips the package, changes 
into the sub-directory containing the script, and prepends 
her original command with the cde-exec executable 
(also included in the package): 


cde-exec python detect_anomalies.py net.log 


cde-exec sets up the environment variables saved 
from Alice’s machine and executes the versions of 
python and its extension modules that are located within 
the package. cde-exec uses ptrace to monitor all 
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Figure 5: Timeline of control flow between target pro- 
gram, kernel, and cde-exec during an open syscall. 


system calls that access files and dynamically rewrites 
their path arguments to the corresponding paths within 
the cde-package/cde-root/ sub-directory. Figure 5 
shows the control flow between the target program, ker- 
nel, and cde-exec during a file-related system call. 

For example, when her script requests to load the 
/usr/lib/logutils.so library using an open sys- 
tem call, cde-exec rewrites the path argument of 
the open call to cde-package/cde-root/usr/lib/ 
logutils.so (see Figure 3). This run-time path redi- 
rection is essential, because /usr/lib/logutils.so 
probably does not exist on Bob’s machine. 


2.3 CDE package portability 


Alice’s CDE package can execute on any Linux ma- 
chine with an architecture and kernel version that are 
compatible with its constituent binaries. CDE currently 
works on 32-bit and 64-bit variants of the x86 archi- 
tecture (1386 and x86-64, respectively). In general, a 
32-bit cde-exec can execute 32-bit packaged applica- 
tions on 32- and 64-bit machines. A 64-bit cde-exec 
can execute both 32-bit and 64-bit packaged applications 
on a 64-bit machine. Extending CDE to other architec- 
tures (e.g., ARM) is straightforward because the st race 
tool that CDE is built upon already works on many archi- 
tectures. However, CDE packages cannot be transported 
across architectures without using a CPU emulator. 

Our portability experiments (87.1) show that pack- 
ages are portable across Linux distros released within 5 
years of the distro where the package originated. Besides 
sharing with colleagues like Bob, Alice can also deploy 
her package to run on a cluster for more computational 
power or to a public-facing server machine for real-time 
online monitoring. Since she does not need to install any- 
thing as root, she does not risk perturbing existing soft- 
ware on those machines. Also, having her script and all 
of its dependencies (including the Python interpreter and 
extension modules) encapsulated within a CDE package 
makes it somewhat “future-proof” and likely to continue 
working on her machine even when its version of Python 
and associated extensions are upgraded in the future. 
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Figure 6: The result of copying a file named 
/usr/bin/ java into the cde-root/ directory. 


3 Semi-automated package completion 


CDE’s primary limitation is that it can only package up 
files accessed on executed program paths. Thus, pro- 
grams run from within a CDE package will fail when exe- 
cuting paths that access new files (e.g., libraries, configu- 
ration files) that the original execution(s) did not access. 
Unfortunately, no automatic tool (static or dynamic) 
can find and package up all the files required to suc- 
cessfully execute all possible program paths, since that 
problem is undecidable in general. Similarly, it is also 
impossible to automatically quantify how “complete” a 
CDE package is or determine what files are missing, 
since every file-related system call instruction could be 
invoked with complex or non-deterministic arguments. 
For example, the Python interpreter executable has only 
one dlopen call site for dynamically loading extension 
modules, but that dlopen could be called many times 
with different dynamically-generated string arguments 
derived from script variables or configuration files. 
There are two ways to cope with this package incom- 
pleteness problem. First, if the user executes additional 
program paths, then CDE will add new files into the same 
cde-package/ directory. However, making repeated 
executions can get tedious, and it is unclear how many 
or which paths are necessary to complete the package’. 
Another way to make CDE packages more com- 
plete is by manually copying additional files and sub- 
directories into cde-package/cde-root/. For exam- 
ple, while executing a Python script, CDE might au- 
tomatically copy the few Python standard library files 
it accesses into, say, cde-package/cde-root/usr/ 
lib/python/. To complete the package, the user 
could copy the entire /usr/lib/python/ directory 
into cde-package/cde-root/ so that all Python li- 
braries are present. A user can usually make his/her 
package complete by copying only a few crucial direc- 
tories into the package, since programs store all of their 
files in several top-level directories (see Section 3.3). 
However, programs also depend on shared libraries 
that reside in system-wide directories like /lib and 
/usr/lib. Copying all the contents of those directo- 
ries into a package results in lots of wasted disk space. 
In Section 3.2, we present an automatic heuristic tech- 
nique that finds nearly all shared libraries that a program 
requires and copies them into the package. 


* similar to trying to achieve 100% coverage during software testing 
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Figure 7: The result of using OKAPI to deep-copy a single /usr/bin/ java file into cde-root/, preserving the 
exact symlink structure from the original directory tree. Boxes are directories (solid arrows point to their contents), 
diamonds are symlinks (dashed arrows point to their targets), and the bold ellipse is the actual java executable file. 


3.1 The OKAPI utility for deep file copying 


Before describing our heuristics for completing CDE 
packages, we first introduce a utility library we built 
called OKAPI (pronounced “oh-copy”), which performs 
detailed copying of files, directories, and symlinks. 
OKAPI does one seemingly-simple task that turns out to 
be tricky in practice: copying a filesystem entity (..e., 
a file, directory, or symlink) from one directory to an- 
other while fully preserving its original sub-directory and 
symlink structure (a process that we call deep-copying). 
CDE uses OKAPI to copy files into the cde-root/ sub- 
directory when creating a new package, and the support 
scripts of Sections 3.2 and 3.3 also use OKAPI. 


For example, suppose that CDE needs to copy the 
/usr/bin/ java executable file into cde-root/ when 
it is packaging a Java application. The straightforward 
way to do this is to use the standard mkdir and cp utili- 
ties. Figure 6 shows the resulting sub-directory structure 
within cde-root/, with the boxes representing direc- 
tories and the bold ellipse representing the copy of the 
java executable file located at cde-root/usr/bin/ 
java. However, it turns out that if CDE were to use 
this straightforward copying method, the Java applica- 
tion would fail to run from within the CDE package! This 
failure occurs because the java executable introspects 
its own path and uses it as the search path for finding 
the Java standard libraries. On our Fedora Core 9 ma- 
chine, the Java standard libraries are actually installed 
m /usr/lib/jvm/java-1.6.0-openjdk-1.6.0.0, 
so when java reads its own path as /usr/bin/ java, it 
cannot possibly use that path to find its standard libraries. 


In order for Java applications to properly run from 
within CDE packages, all of their constituent files must 
be “deep-copied” into the package while replicating 
their original sub-directory and symlink structures. Fig- 
ure 7 illustrates the complexity of deep-copying a single 
file, /usr/bin/ java, into cde-root/. The diamond- 
shaped nodes represent symlinks, and the dashed arrows 
point to their targets. Notice how /usr/bin/java isa 


symlink to /etc/alternatives/ java, which 1s itself 
a symlink to /usr/lib/jvm/jre-1.6.0-openjdk/ 
bin/ java. Another complicating factor is that /usr/ 
lib/jvm/jre-1.6.0-openjdk is itself a symlink 
to the /usr/lib/jvm/java-1.6.0-openjdk-1.6. 
0.0/jre/ directory, so the actual java executable 
resides in /usr/lib/jvm/java-1.6.0-openjdk-1. 
6.0.0/jre/bin/. Java can only find its standard li- 
braries when these paths are all faithfully replicated 
within the CDE package. 

The OKAPI utility library automatically performs the 
deep-copying required to generate the filesystem struc- 
ture of Figure 7. Its interface is as simple as ordinary cp: 
The caller simply requests for a path to be copied into a 
target directory, and OKAPI faithfully replicates the sub- 
directory and symlink structure. 

OKAPI performs one additional task: rewriting the 
contents of symlinks to transform absolute path targets 
into relative path targets within the destination directory 
(e.g., cde-root/). In our example, /usr/bin/java 
is a symlink to /etc/alternatives/ java. However, 
OKAPI cannot simply create the cde-root/usr/bin/ 
java symlink to also point to /etc/alternatives/ 
java, since that target path is outside of cde-root/. 
Instead, OKAPI must rewrite the symlink target so that 
it actually refers to ../../etc/alternatives/ java, 
which is a relative path that points to cde-root/etc/ 
alternatives/jJava. 

The details of this particular example are not impor- 
tant, but the high-level message that Figure 7 conveys 
is that deep-copying even a single file can lead to the 
creation of over a dozen sub-directories and (possibly- 
rewritten) symlinks. The problem that OKAPI solves is 
not Java-specific; we have observed that many real-world 
Linux applications fail to run from within CDE packages 
unless their files are deep-copied in this detailed way. 

OKAPITis also available as a free standalone command- 
line tool [1]. To our knowledge, no other Linux file copy- 
ing tool (e.g., cp, rsync) can perform the deep-copying 
and symlink rewriting that OKAPI does. 
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3.2 Heuristics for copying shared libraries 


When Linux starts executing a dynamically-linked ex- 
ecutable, the dynamic linker (e.g., 1d-linux*.sox) 
finds and loads all shared libraries that are listed in a spe- 
clal .dynamic section within the executable file. Run- 
ning the 1dd command on the executable shows these 
start-up library dependencies. When CDE is executing a 
target program to create a package, CDE finds all of these 
dependencies as well because they are loaded at start-up 
time via open system calls. 

However, programs sometimes load shared libraries in 
the middle of execution using, say, the dlopen function. 
This run-time loading occurs mostly in GUI programs 
with a plug-in or extension architecture. For example, 
when the user instructs Firefox to visit a web page with 
a Flash animation, Firefox will use dlopen to load the 
Adobe Flash Player shared library. 1dd will not find that 
dependency since it is not hard-coded in the .dynamic 
section of the Firefox executable, and CDE will only 
find that dependency if the user actually visits a Flash- 
enabled web page while creating a package for Firefox. 

We have created a simple heuristic-based script that 
finds most or all shared libraries that a program requires?. 
The user first creates a base CDE package by executing 
the target program once (or a few times) and then runs 
our script, which works as follows: 


1. Find all ELF binaries (executables and shared l- 
braries) within the package using the Linux find 
and file utilities. 


2. For each binary, find all constant strings using the 
strings utility, and look for strings containing 
“so” since those are likely to be shared libraries. 


3. Call the locate utility on each candidate shared li- 
brary string, which returns the full absolute paths of 
all installed shared libraries that match each string. 


4. Use OKAPI to copy each library into the package. 


5. Repeat this process until no new libraries are found. 


This heuristic technique works well in practice be- 
cause programs often list all of their dependent shared 
libraries in string constants within their binaries. The 
main exception occurs in dynamic languages like Python 
or MATLAB, whose programs often dynamically gener- 
ate shared library paths based on the contents of scripts 
and configuration files. 

Another limitation of this technique is that it is overly 
conservative and can create larger-than-needed pack- 
ages, since the locate utility can find more libraries 
than the target program actually needs. 


always a superset of the shared libraries that 1.dd finds 
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3.3. OKAPI-based directory copying script 


In general, running an application once under CDE mon- 
itoring only packages up a subset of all required files. In 
our experience, the easiest way to make CDE packages 
complete is to copy entire sub-directories into the pack- 
age. To facilitate this process, we created a script that 
repeatedly calls OKAPI to copy an entire directory at a 
time into cde- root /, automatically following symlinks 
to other directories and recursively copying as needed. 

Although this approach might seem primitive, it is ef- 
fective in practice because applications often store all of 
their files in a few top-level directories. When a user 
inspects the directory structure within cde-root/, it 
is usually obvious where the application’s files reside. 
Thus, the user can run our OKAPI-based script to copy 
the entirety of those directories into the package. 


Evaluation: To demonstrate the efficacy of this ap- 
proach, we have created complete self-contained CDE 
packages for six of the largest and most popular Linux 
applications. For each app, we made an initial packag- 
ing run with cde, inspected the package contents, and 
copied at most three directories into the package. The 
entire packaging process took several minutes of human 
effort per application. Here are our full results: 


e AbiWord is a free alternative to Microsoft Word. 
After an initial packaging run, we saw that some 
plug-ins were included in the cde-root/usr/ 

and cde-root/ 

usr/lib/goffice/0.8.1/plugins directories. 

Thus, we copied the entirety of those two original 

directories into cde-root/ to complete its pack- 

age, thereby including all AbiWord plug-ins. 


lib/abiword-2.8/plugins 


e Eclipse is a sophisticated IDE and software de- 
velopment platform. We completed its package 
by copying the /usr/lib/eclipse and /usr/ 
share/eclipse directories into cde-root/. 


e Firefox is a popular web browser. We completed its 
package by copying /usr/lib/firefox-3.6.18 
and /usr/lib/firefox-addons into 
cde-root/ (plus another directory for the 
third-party Adobe Flash player plug-in). 


e GIMP is a sophisticated graphics editing tool. 
We completed its package by copying /usr/1lib/ 
gimp/2.0 and /usr/share/gimp/2.0. 


e Google Earth is an interactive 3D mapping ap- 
plication. We completed its package by copying 
/opt/google/earth into cde-root/. 


e OpenOffice.org is a free alternative to the Mi- 
crosoft Office productivity suite. We completed its 
package by copying the /usr/lib/openoffice 
directory into cde-root/. 
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Alice's CDE package 


libpython2.6.so 


error log 


Figure 8: Example filesystem layout on Bob’s machine after he receives a CDE package from Alice (boxes are direc- 
tories, ellipses are files). CDE’s seamless execution mode enables Bob to run Alice’s packaged script on the log files 
in /var/log/httpd/ without first moving those files inside of cde-root/. 


4 Seamless execution mode 


When executing a program from within a package, 
cde-exec redirects all file accesses into the package 
by default, thereby creating a chroot-like sandbox with 
cde-package/cde-root/ as the pseudo-root direc- 
tory (see Figure 3, Step 3). However, unlike chroot, CDE 
does not require root access to run, and its sandbox poli- 
cies are flexible and user-customizable [20]. 

This default chroot-like execution mode is fine for run- 
ning self-contained GUI applications like games or web 
browsers, but it is a somewhat awkward way to run most 
types of UNIX-style command-line programs that sys- 
tem administrators, developers, and hackers often prefer. 
If users are running, say, a compiler or command-line im- 
age processing utility from within a CDE package, they 
would need to first move their input data files into the 
package, run the target program using cde-exec, and 
then move the resulting output data files back out of the 
package, which is a cumbersome process. 

In our Alice-and-Bob example from Section 2 (see 
Figure 3), if Bob wants to run Alice’s anomaly detec- 
tion script on his own log data (e.g., bob.log), he 
needs to first move his data file inside of cde-package/ 
cde-root/, change into the appropriate sub-directory 
deep within the package, and then run: 


cde-exec python detect_anomalies.py bob.log 


In contrast, if Bob had actually installed the proper 
version of Python and its required extension modules on 
his machine, then he could run Alice’s script from any- 
where on his filesystem with no restrictions. Some CDE 
users wanted CDE-packaged programs to behave just like 
regularly-installed programs rather than requiring input 


files to be moved inside of a cde-package/cde-root / 
sandbox, so we implemented a new seamless execution 
mode that largely achieves this goal. 

Seamless execution mode works using a simple 
heuristic: If cde-exec is being invoked from a di- 
rectory not in the CDE package (i.e., from somewhere 
else on the user’s filesystem), then only redirect a path 
into cde-package/cde-root/ if the file that the path 
refers to actually exists within the package. Otherwise 
simply leave the path unmodified so that the program can 
access the file normally. No user intervention is needed 
in the common case. 

The intuition behind why this heuristic works is 
that when programs request to load libraries and other 
mandatory components, those files must exist within the 
package, so their paths are redirected. On the other hand, 
when programs request to load an input file passed via, 
say, a command-line argument, that file does not exist 
within the package, so the original path is used to retrieve 
it from the native filesystem. 

In the example shown in Figure 8, if Bob ran Alice’s 
script to analyze an arbitrary log file on his machine (e.g., 
his web server log, /var/log/httpd/access_log), 
then cde-exec will redirect Python’s request for its own 
libraries (e.g., /lib/libpython2.6.so and /usr/ 
lib/logutils.so) inside of cde-root/ since those 
files exist within the package, but cde-exec will not 
redirect /var/log/httpd/access_log and instead 
load the real file from its original location. 

Seamless execution mode fails when the user 
wants the packaged program to access a file from 
the native filesystem, but an_ identically-named 
file actually exists within the package. In the 
above example, if cde-package/cde-root/var/ 
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Figure 9: An example use of CDE’s streaming mode to run Eclipse 3.6 on any Linux machine without installation. 
cde-exec fetches all dependencies on-demand from a remote Linux distro and stores them in a local cache. 


log/httpd/access_log existed, then that file 
would be processed by the Python script instead of 
/var/log/httpd/access_log. There is no auto- 
mated way to resolve such name conflicts, but cde-exec 
provides a “verbose mode” where it prints out a log 
of what paths were redirected within the package. 
The user can inspect that log and then manually write 
redirection/ignore rules in a configuration file to control 
which paths cde-exec redirects into cde-root/. For 
instance, the user could tell cde-exec to not redirect 
any paths starting with /var/log/httpd/*«. 

Using seamless execution mode, our users have been 
able to run software such as programming language in- 
terpreters and compilers, scientific research tools, and 
sysadmin scripts from CDE packages and have them be- 
have just like regularly-installed programs. 


5 On-demand application streaming 


We now introduce a new application streaming mode 
where CDE users can instantly run any Linux application 
on-demand without having to create, transfer, or install 
any packages. Figure 2 shows a high-level architectural 
overview. The basic idea is that a system administra- 
tor first installs multiple versions of many popular Linux 
distros in a “distro farm” in the cloud (or an internal com- 
pute cluster). When a user wants to run some application 
that is available on a particular distro, they use sshfs (an 
ssh-based network filesystem [9]) to mount the root di- 
rectory of that distro into a special cde-remote-root/ 
mountpoint on their Linux machine. Then the user can 
use CDE’s streaming mode to run any application from 
that distro locally on their own machine. 
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5.1 Implementation and example 


Figure 9 shows an example of streaming mode. Let’s say 
that Alice wants to run the Eclipse 3.6 IDE on her Linux 
machine, but the particular distro she is using makes it 
difficult to obtain all the dependencies required to install 
Eclipse 3.6. Rather than suffering through dependency 
hell, Alice can simply connect to a distro in the farm that 
contains Eclipse 3.6 and then use CDE’s streaming mode 
to “harvest” the required dependencies on-demand. 


Alice first mounts the root directory of the re- 
mote distro at cde-remote-root/. Then she 
runs “cde-exec -s eclipse” (-s activates 
streaming mode). cde-exec finds and executes 
cde-remote-root/bin/eclipse. When that exe- 
cutable requests shared libraries, plug-ins, or any other 
files, cde-exec will redirect the respective paths into 
cde-remote-root/, thereby executing the version of 
Eclipse 3.6 that resides in the cloud distro. However, 
note that the application is running locally on Alice’s 
machine, not in the cloud. 


An astute reader will immediately realize that running 
applications in this manner can be slow, since files are be- 
ing accessed from a remote server. While sshfs performs 
some caching, we have found that it does not work well 
enough in practice. Thus, we have implemented our own 
caching layer within CDE: When a remote file is accessed 
from cde-remote-root/, cde-exec uses OKAPI to 
make a deep-copy into a local cde-root/ directory and 
then redirects that file’s path into cde- root /. In stream- 
ing mode, cde-root/ initially starts out empty and then 
fills up with a subset of files from cde-remote-root/ 
that the target program has accessed. 
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To avoid unnecessary filesystem accesses, CDE’s 
cache also keeps a list of file paths that the target program 
tried to access from the remote server, even keeping paths 
for non-existent files. On subsequent runs, when the pro- 
gram tries to access one of those paths, cde-exec will 
redirect the path into the local cde-root/ cache. It is 
vital to track non-existent files since programs often try 
to access non-existent files at start-up while doing, say, a 
search for shared libraries by probing a list of directories 
in a search path. If CDE did not track non-existent files, 
then the program would still access the directory entries 
on the remote server before discovering that those files 
still do not exist, thus slowing down performance. 

With this cache in place, the first time an application is 
run, all of its dependencies must be downloaded, which 
could take several seconds to minutes. This one-time de- 
lay is unavoidable. However, subsequent runs simply use 
the files already in the local cache, so they execute at 
regular cde-exec speeds. An added bonus is that even 
running a different application for the first time might 
still result in some cache hits for, say, generic libraries 
like 1ibc, so the entire application does not need to be 
downloaded. 

Finally, the package incompleteness problem faced by 
regular CDE (see Section 3) no longer exists in streaming 
mode. When the target application needs to access new 
files that do not yet exist in the local cache (e.g., Alice 
loads a new Eclipse plug-in), those files are transparently 
fetched from the remote server and cached. 


5.2 Synergy with package managers 


Nearly all Linux users are currently running one partic- 
ular distro with one default package manager that they 
use to install software. For instance, Ubuntu users must 
use APT, Fedora users must use YUM, SUSE users must 
use Zypper, Gentoo users must use Portage, etc. More- 
over, different releases of the same distro contain differ- 
ent software package versions, since distro maintainers 
add, upgrade, and delete packages in each new release’. 

As long as a piece of software and all of its depen- 
dencies are present within the package manager of the 
exact distro release that a user happens to be using, then 
installation is trivial. However, as soon as even one de- 
pendency cannot be found within the package manager, 
then users must revert to the arduous task of compiling 
from source (or configuring a custom package manager). 

CDE’s streaming mode frees Linux users from this 
single-distro restriction and allows them to run software 


+We once tried installing a machine learning application that de- 
pended on the 1 ibcv computer vision library. The required 1 ibcv 
version was found in the APT repository on Ubuntu 10.04, but it 
was not found in the repositories on the two immediately neighboring 
Ubuntu releases: 9.10 and 10.10. 


that is available within the package manager of any distro 
in the cloud distro farm. The system administrator is re- 
sponsible for setting up the farm and provisioning access 
rights (e.g., ssh keys) to users. Then users can directly in- 
stall packages in any cloud distro and stream the desired 
applications to run locally on their own machines. 

Philosophically, CDE’s streaming mode maximizes 
user freedom since users are now free to run any appli- 
cation in any package manager from the comfort of their 
own machines, regardless of which distro they choose 
to use. CDE complements traditional package managers 
by leveraging all of the work that the maintainers of 
each distro have already done and opening up access to 
users of all other distros. This synergy can potentially 
eliminate quasi-religious squabbles and flame-wars over 
the virtues of competing distros or package management 
systems. Such fighting is unnecessary since CDE allows 
users to freely choose from amongst all of them. 


6 Real-world use cases 


Since we released the first version of CDE on Novem- 
ber 9, 2010, it has been downloaded at least 3,000 times 
as of September 2011 [1]. We cannot track how many 
people have directly checked out its source code from 
GitHub, though. We have exchanged hundreds of emails 
with CDE users and discovered six salient real-world use 
cases as a result of these discussions. Table 1 shows that 
we used 16 CDE packages, mostly sent in by our users, 
as benchmarks in the experiments reported in Section 7. 
They contain software written in diverse programming 
languages and frameworks. We now summarize the use 
case categories and benchmarks (highlighted in bold). 


Distributing research software: The creators of two 
research tools found CDE online and used it to create 
portable packages that they uploaded to their websites: 

The website for graph-tool, a Python/C++ module 
for analyzing graphs, lists these (direct) dependencies: 
“GCC 4.2 or above, Boost libraries, Python 2.5 or above, 
expat library, NumPy and SciPy Python modules, GCAL 
C++ geometry library, and Graphviz with Python bind- 
ings enabled.” [11] Unsurprisingly, lots of people had 
trouble compiling it: 47% of all messages on its mailing 
list (137 out of 289) were questions related to compila- 
tion problems. The author of graph-tool used CDE 
to automatically create a portable package (containing 
149 shared libraries and 1909 total files) and uploaded 
it to his website so that users no longer needed to suffer 
through the pain of manually compiling it. 

arachni, a Ruby-based tool that audits web appli- 
cation security [10], requires six hard-to-compile Ruby 
extension modules, some of which depend on versions 
of Ruby and libraries that are not available in the pack- 
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Package name Description 


Distributing research software 


Dependencies Creator 


arachni Web app. security scanner framework [10] Ruby (+ extensions) security researcher 
graph-tool Lib. for manipulation & analysis of graphs [11] Python, C++, Boost math researcher 
pads Language for processing ad-hoc data [19] Perl, ML, Lex, Yacc self 
saturn Static program analysis framework [13] Perl, ML, Berkeley DB self 


Running production software on incompatible distros 


meld 


Interactive visual diff and merge tool for text 


Python, GTK+ software engineer 


bio-menace Classic video game within a MS-DOS emulator DOSBox, SDL game enthusiast 
google-earth 3D interactive map application by Google shell scripts, OpenGL self 
Creating reproducible computational experiments 

kpiece Robot motion planning algorithm [26] C++, OpenGL robotics researcher 
gadm Genetic algorithm for social networks [21] C++, make, R self 
Deploying computations to cluster or cloud 

ztopo Batch processing of topological map images C++, Qt graduate student 
klee Automatic bug finder & test case generator [16] C++, LLVM, pClibc self 
Submitting executable bug reports 

coq-bug-2443 Incorrect output by Coq proof assistant [2] ML, Coq bug reporter 
gcc-bug-46651 Causes GCC compiler to segfault [3] gcc bug reporter 
livm-bug-8679 Runs LLVM compiler out of memory [5] C++, LLVM bug reporter 


Collaborating on class programming projects 


email-search 
VE=OSG 


Natural language semantic email search 
3D virtual reality modeling of home appliances 


college student 
college student 


Python, NLTK, Octave 
C++, OpenSceneGraph 


Table 1: CDE packages used as benchmarks in our experiments, grouped by use cases. ‘self’ in the ‘Creator’ column 
means package was created by the author; all other packages created by CDE users (mostly people we have never met). 


age managers of most modern Linux distributions. Its 
creator, a security researcher, created and uploaded CDE 
packages and then sent us a grateful email describing 
how much effort CDE saved him: “My guess is that it 
would take me half the time of the development process 
to create a self-contained package by hand; which would 
be an unacceptable and truly scary scenario.” 

In addition, we used CDE to create portable binary 
packages for two of our Stanford colleagues’ research 
tools, which were originally distributed as tarballs of 
source code: pads [19] and saturn [13]. 44% of 
the messages on the pads mailing list (38 / 87) were 
questions related to troubles with compiling it (22% for 
saturn). Once we successfully compiled these projects 
(after a few hours of improvising our own hacks since the 
instructions were outdated), we created CDE packages by 
running their regression test suites, so that others do not 
need to suffer through the compilation process. 

Even the saturn team leader admitted in a public 
email, “As it stands the current release likely has prob- 
lems running on newer systems because of bit rot — some 
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libraries and interfaces have evolved over the past cou- 
ple of years in ways incompatible with the release.” |7] 
In contrast, our CDE packages are largely immune to “bit 
rot” (until the user-kernel ABI changes) because they 
contain all required dependencies. 


Running software on incompatible distros: Even 
production-quality software might be hard to install on 
Linux distros with older kernel or library versions, espe- 
cially when system upgrades are infeasible. For exam- 
ple, an engineer at Cisco wanted to run some new open- 
source tools on his work machines, but the IT department 
mandated that those machines run an older, more secure 
enterprise Linux distro. He could not install the tools 
on those machines because that older distro did not have 
up-to-date libraries, and he was not allowed to upgrade. 
Therefore, he installed a modern distro at home, ran CDE 
on there to create packages for the tools he wanted to 
port, and then ran the tools from within the packages 
on his work machines. He sent us one of the packages, 
which we used as a benchmark: the meld visual diff tool. 
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Hobbyists applied CDE in a similar way: A game en- 
thusiast could only run a classic game (bio-menace) 
within a DOS emulator on one of his Linux machines, 
so he used CDE to create a package and can now play the 
game on his other machines. We also helped a user create 
a portable package for the Google Earth 3D map applica- 
tion (google-earth), so he can now run it on older dis- 
tros whose libraries are incompatible with Google Earth. 


Reproducible computational experiments: A funda- 
mental tenet of science is that colleagues should be able 
to reproduce the results of one’s experiments. In the past 
few years, science journals and CS conferences (e.g., 
SIGMOD, FSE) have encouraged authors of published 
papers to put their code and datasets online, so that oth- 
ers can independently re-run, verify, and build upon their 
experiments. However, it can be hard for people to set up 
all of the (often-undocumented) dependencies required 
to re-run experiments. In fact, it can even be difficult 
to re-run one’s own experiments in the future, due to in- 
evitable OS and library upgrades. To ensure that he could 
later re-run and adjust experiments in response to re- 
viewer critiques for a paper submission [16], our group- 
mate Cristian took the hard drive out of his computer at 
paper submission time and archived it in his drawer! 

In our experience, the results of many computational 
science experiments can be reproduced within CDE pack- 
ages since the programs are output-deterministic [15], al- 
ways producing the same outputs (e.g., statistics, graphs) 
for a given input. A robotics researcher used CDE to 
make the experiments for his motion planning paper 
(kpiece) [26] fully-reproducible. Similarly, we helped a 
social networking researcher create a reproducible pack- 
age for his genetic algorithm paper (gadm) [21]. 


Deploying computations to cluster or cloud: People 
working on computational experiments on their desktop 
machines often want to run them on a cluster for greater 
performance and parallelism. However, before they can 
deploy their computations to a cluster or cloud comput- 
ing (e.g., Amazon EC2), they must first install all of the 
required executables and dependent libraries on the clus- 
ter machines. At best, this process is tedious and time- 
consuming; at worst, it can be impossible, since regular 
users often do not have root access on cluster machines. 

A user can create a self-contained package using CDE 
on their desktop machine and then execute that package 
on the cluster or cloud (possibly many instances in par- 
allel), without needing to install any dependencies or to 
get root access on the remote machines. For instance, our 
colleague Peter wanted to use a department-administered 
100-CPU cluster to run a parallel image processing job 
on topological maps (ztopo). However, since he did not 
have root access on those older machines, it was nearly 
impossible for him to install all of the dependencies re- 


quired to run his computation, especially the image pro- 
cessing libraries. Peter used CDE to create a package by 
running his job on a small dataset on his desktop, trans- 
ferred the package and the complete dataset to the cluster, 
and then ran 100 instances of it in parallel there. 
Similarly, we worked with lab-mates to use CDE to de- 
ploy the CPU-intensive klee [16] bug finding tool from 
the desktop to Amazon’s EC2 cloud computing service 
without needing to compile Klee on the cloud machines. 
Klee can be hard to compile since it depends on LLVM, 
which is very picky about specific versions of GCC and 
other build tools being present before it will compile. 


Submitting executable bug reports: Bug reporting is 
a tedious manual process: Users submit reports by writ- 
ing down the steps for reproduction, exact versions of 
executables and dependent libraries, (e.g., “I’m running 
Java version 1.6.0_13, Eclipse SDK Version 3.6.1, ...””), 
and maybe attaching an input file that triggers the bug. 
Developers often have trouble reproducing bugs based 
on these hand-written descriptions and end up closing re- 
ports as “not reproducible.” 

CDE offers an easier and more reliable solution: The 
bug reporter can simply run the command that triggers 
the bug under CDE supervision to create a CDE package, 
send that package to the developer, and the developer can 
re-run that same command on their machine to reproduce 
the bug. The developer can also modify the input file and 
command-line parameters and then re-execute, in order 
to investigate the bug’s root cause. 

To show that this technique works, we asked peo- 
ple who recently reported bugs to popular open-source 
projects to use CDE to create executable bug reports. 
Three volunteers sent us CDE packages, and we were 
able to reproduce all of their bugs: one that causes 
the Coq proof assistant to produce incorrect output 
(coq—bug—-2443) [2], one that segfaults the GCC com- 
piler (gec-bug-—46651) [3], and one that makes the 
LLVM compiler allocate an enormous amount of mem- 
ory and crash (llvm—bug- 8679) [5]. 

Since CDE is not a record-replay tool, it 1s not guar- 
anteed to reproduce non-deterministic bugs. However, at 
least it allows the developer to run the exact versions of 
the faulting executables and dependent libraries. 


Collaborating on class programming projects: Two 
users sent us CDE packages they created for collaborat- 
ing on class assignments. Rahul, a Stanford grad student, 
was using NLTK [22], a Python module for natural lan- 
guage processing, to build a semantic email search en- 
gine (email-search) for a machine learning class. De- 
spite much struggle, Rahul’s two teammates were unable 
to install NLTK on their Linux machines due to conflict- 
ing library versions and dependency hell. This meant 
that they could only run one instance of the project at a 
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time on Rahul’s laptop for query testing and debugging. 
When Rahul discovered CDE, he created a package for 
their project and was able to run it on his two teammates’ 
machines, so that all three of them could test and debug 
in parallel. Joshua, an undergrad from Mexico, emailed 
us a similar story about how he used CDE to collaborate 
on and demo his virtual reality class project (vr-osg). 


7 Evaluation 


7.1 Evaluating CDE package portability 


To show that CDE packages can successfully execute on 
a wide range of Linux distros and kernel versions, we 
tested our benchmark packages on popular distros from 
the past 5 years. We installed fresh copies of these dis- 
tros (listed with the versions and release dates of their 
kernels) on a 3GHz Intel Xeon x86-64 machine: 


e Sep 2006 — CentOS 5.5 (Linux 2.6.18) 

e Oct 2007 — Fedora Core 8 (Linux 2.6.23) 

e Oct 2008 — openSUSE 11.1 (Linux 2.6.27) 

e Sep 2009 — Ubuntu 9.10 (Linux 2.6.31) 

e Feb 2010 — Mandriva Free Spring (Linux 2.6.33) 
e Aug 2010 — Linux Mint 10 (Linux 2.6.35) 


We installed 32-bit and 64-bit versions of each distro 
and executed our 32-bit benchmark packages (those cre- 
ated on 32-bit distros) on the 32-bit versions, and our 
64-bit packages on the 64-bit versions. Although all of 
these distros reside on one physical machine, none of our 
benchmark packages were created on that machine: CDE 
users created most of the packages, and we made sure to 
create our own packages on other machines. 


Results: Out of the 96 unique configurations we tested 
(16 CDE packages each run on 6 distros), all executions 
succeeded except for one’. By “succeeded”, we mean 
that the programs ran correctly, as far as we could ob- 
serve: Batch programs generated identical outputs across 
distros; regression tests passed; we could interact nor- 
mally with the GUI programs; and we could reproduce 
the symptoms of the executable bug reports. 

In addition, we were able to successfully execute all 
of our 32-bit packages on the 64-bit versions of CentOS, 
Mandriva, and openSUSE (the other 64-bit distros did 
not support executing 32-bit binaries). 

In sum, we were able to use CDE to successfully exe- 
cute a diverse set of programs (Table 1) “out-of-the-box” 
on a variety of Linux distributions from the past 5 years, 
without performing any installation or configuration. 


>vr-osg failed on Fedora Core 8 with a known error related to 


graphics drivers. 
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7.2 Comparing against a one-click installer 


To show that the level of portability that CDE enables 
is substantive, we compare CDE against a representative 
one-click installer for a commercial application. We in- 
stalled and ran Google Earth (Version 5.2.1, Sep 2010) 
on our 6 test distros using the official 32-bit installer from 
Google. Here is what happened on each distro: 


e CentOS (Linux 2.6.18) — installs fine but Google 
Earth crashes upon start-up with variants of this 
error message repeated several times, because the 
GNU Standard C++ Library on this OS is too old: 


/usr/lib/libstdc++.so.6: 
version ‘GLIBCXX 3.4.9’ 
(required by 


not found 
./libgoogleearth_free.so) 
e Fedora (Linux 2.6.23) — same error as CentOS 

e openSUSE (Linux 2.6.27) — installs and runs fine 
e Ubuntu (Linux 2.6.31) — installs and runs fine 


e Mandriva (Linux 2.6.33) — installs fine but Google 
Earth crashes upon start-up with this error message 
because a required graphics library is missing: 


error while loading shared libraries: 
TiDpGlsSOs.1% 
file: 


cannot open shared object 
No such file or directory 


e Linux Mint (Linux 2.6.35) — installer program 
crashes with this cryptic error message because the 
XML processing library on this OS is too new and 
thus incompatible with the installer: 


setup.data/setup.xml:1l: parser error : 
Document is empty 
setup.data/setup.xml:1l: parser error : 
not found 
Couldn’t load ’setup.data/setup.xml’ 


Start tag expected, ’<’ 


In summary, on 4 out of our 6 test distros, a bi- 
nary installer for the fifth major release of Google Earth 
(v5.2.1), a popular commercial application developed by 
a well-known software company, failed in its sole goal 
of allowing the user to run the application, despite adver- 
tising that it should work on any Linux 2.6 machine. 

If a team of professional Linux developers had this 
much trouble getting a widely-used commercial applica- 
tion to be portable across distros, then it is unreasonable 
to expect researchers or hobbyists to be able to easily 
create portable Linux packages for their prototypes. 

In contrast, once we were able to install Google 
Earth on just one machine (Dell desktop running Ubuntu 
8.04), we ran it under CDE supervision to create a self- 
contained package, copied the package to all 6 test dis- 
tros, and successfully ran Google Earth on all of them 
without any installation or configuration. 
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Native CDE slowdown 
Benchmark runtime pack exec 
400.perlbench 23.7s 3.0% 2.5% 
AQ1.. bz 12 47.3s 0.2% 0.1% 
AOS 6 GheG 0.93s 2.7% 2.2% 
410.bwaves 185.7s 0.2% 0.3% 
416.gamess 129.9s 0.1% 0% 
429.mcf 16.2s 2.7% 0% 
433.milc 15.1s 2% 0.6% 
434.zeusmp 36.38 O% 0% 
435.gromacs 133.9s 0.3% 0.1% 
436.cactusADM 26.1s 0% 0% 
437.leslie3d 136.0s 0.1% 0% 
444.namd 13.9s 3% 0.3% 
445.gobmk 97.58 0.4% 0.2% 
447.deallIlI 28.78 0.5% 0.2% 
450.soplex 5.78 2.2% 1.8% 
453.povray 7.88 2.2% 1.9% 
454.calculix 1.4s 5% 4% 
456.hmmer 48.2s 0.2% 0.1% 
458.sjeng 121.48 0% 0.2% 
459.GemsFDTD 55.2s 0.2% 1.6% 
462.libquantum 1.8s 2% 0.6% 
464.h264ref 87.2s 0% 0% 
465.tonto 229.9s 0.8% 0.4% 
470.1lbm 31.9s 0% 0% 
471.omnetpp 51.0s 0.7% 0.6% 
473.astar 103.7s 0.2% 0% 
48l.wrf 161.6s 0.2% 0% 
482.sphinx3 8.88 3% 0% 
483.xalancbmk 58.0s 1.2% 1.8% 

Table 2: Quantifying run-time slowdown of CDE 


package creation and execution within a package on the 
SPEC CPU2006 benchmarks, using the “train’’ datasets. 


7.3. Evaluating CDE run-time slowdown 


The primary drawback of executing a CDE-packaged ap- 
plication is the run-time slowdown due to extra user- 
kernel context switches. Every time the target applica- 
tion issues a system call, the kernel makes two extra con- 
text switches to enter and then exit the cde-exec mon- 
itoring process, respectively. cde-exec performs some 
computations to calculate path redirections, but its run- 
time overhead is dominated by context switching®. 

We informally evaluated the run-time slowdown of 
cde and cde-exec on 34 diverse Linux applications. In 
summary, for CPU-bound applications, CDE causes al- 
most no slowdown, but for I/O-bound applications, CDE 
causes a slowdown of up to ~30%. 

We first ran CDE on the entire SPEC CPU2006 


Disabling path redirection still results in similar overheads. 


Native CDE slowdown _ Syscalls 
Command time pack exec _— per sec 
gadm (algorithm) 4187s 0%! 0%! 19 
pads (inferencer) 18.68 3%! 1%! 478 
klee 7.98 31% 2%! 260 
gadm (make plots) 128 8% 2%! 544 
gadm (C++ comp) 8.58 17% 5% 1459 
saturn 222.78 18% 18% 6477 
google-earth 12.5s 65% 19% 7938 
pads (compiler) 1.7s 59% 28 % 6969 
Table 3: Quantifying run-time slowdown of CDE 


package creation and execution within a package. Each 
entry reports the mean taken over 5 runs; standard devi- 
ations are negligible. Slowdowns marked with ' are not 
statistically significant at p < 0.01 according to a t-test. 


benchmark suite (both integer and floating-point bench- 
marks) [8]. We chose this suite because it contains CPU- 
bound applications that are representative of the types 
of programs that computational scientists and other re- 
searchers are likely to run with CDE. For instance, SPEC 
CPU2006 contains benchmarks for video compression, 
molecular dynamics simulation, image ray-tracing, com- 
binatorial optimization, and speech recognition. 

We ran these experiments on a Dell machine with a 
2.67GHz Intel Xeon CPU running a 64-bit Ubuntu 10.04 
distro (Linux 2.6.32). Each trial was run three times, but 
the variances in running times were negligible. 

Table 2 shows the percentage slowdowns incurred 
by using cde to create each package (the ‘pack’ col- 
umn) and by using cde-exec to execute each package 
(the ‘exec’ column). The ‘exec’ column slowdowns are 
shown in bold since they are more important for our 
users: A package is only created once but executed mul- 
tiple times. In sum, slowdowns ranged from non-existent 
to ~4%, which is unsurprising since the SPEC CPU2006 
benchmarks were designed to be CPU-bound and not 
make much use of system calls. 

To test more realistic I/O-bound applications, we mea- 
sured running times for executing the following com- 
mands in the five CDE packages that we created (those 
labeled with “self” in the “Creator” column of Table 1): 


@ pads — Compile a PADS [19] specification into C 
code (the “pads (compiler)” row in Table 3), and 
then infer a specification from a data file (the “pads 
(inferencer)” row in Table 3). 


e gadm — Reproduce the GADM experiment [21]: 
Compile its C++ source code (‘C++ comp’), run ge- 
netic algorithm (‘algorithm’), and use the R statis- 
tics software to visualize output data (‘make plots’). 
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@ google-earth — Measure startup time by 
launching it and then quitting as soon as the initial 
Earth image finishes rendering and stabilizes. 


e klee — Use Klee [16] to symbolically execute a 
C target program (a STUN server) for 100,000 in- 
structions, which generates 21 test cases. 


@ saturn — Run the regression test suite, which con- 
tains 69 tests (each is a static program analysis). 


We measured the following on a Dell desktop (2GHz 
Intel x86, 32-bit) running Ubuntu 8.04 (Linux 2.6.24): 
number of seconds it took to run the original command 
(‘Native time’), percent slowdown vs. native when run- 
ning a command with cde to create a package (‘pack’), 
and percent slowdown when executing the command 
from within a CDE package with cde-exec (‘exec’). We 
ran each benchmark five times under each condition and 
report mean running times. We used an independent two- 
group t-test [17] to determine whether each slowdown 
was statistically significant (1.e., whether the means of 
two sets of runs differed by a non-trivial amount). 

Table 3 shows that the more system calls a program 
issues per second, the more CDE causes it to slow down 
due to the extra context switches. Creating a CDE pack- 
age (‘pack’ column) is slower than executing a program 
within a package (“exec’ column) because CDE must cre- 
ate new sub-directories and copy files into the package. 

CDE execution slowdowns ranged from negligible (not 
statistically significant) to ~30%, depending on system 
call frequency. As expected, CPU-bound workloads like 
the gadm genetic algorithm and the pads inferencer ma- 
chine learning algorithm had almost no slowdown, while 
those that were more I/O- and network-intensive (e.g., 
google-earth) had the largest slowdowns. 

When using CDE to run GUI applications, we did not 
notice any loss in interactivity due to the slowdowns. 
When we navigated around the 3D maps within the 
google-earth GUL, we felt that the CDE-packaged ver- 
sion was just as responsive as the native version. When 
we ran GUI programs from CDE packages that users sent 
to us (the bio—menace game, meld visual diff tool, and 
vr-osg), we also did not perceive any visible lag. 

The main caveat of these experiments is that they are 
informal and meant to characterize “typical-case”’ behav- 
ior rather than being stress tests of worst-case behavior. 
One could imagine developing adversarial I/O intensive 
benchmarks that issue tens or hundreds of thousands of 
system calls per second, which would lead to greater 
slowdowns. We have not run such experiments yet. 

Finally, we also ran some informal performance tests 
of cde-exec’s seamless execution mode. As expected, 
there were no noticeable differences in running times 
versus regular cde-exec, since the context-switching 
overhead dominates cde-exec computation overhead. 
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We know of no published system that automatically cre- 
ates portable software packages in situ from a live run- 
ning machine like CDE does. Existing tools for creating 
self-contained applications all require the user to manu- 
ally specify dependencies at package creation time. For 
example, Mac OS X programmers can create application 
bundles using Apple’s developer tools IDE [6]. Research 
prototypes like PDS [14], which creates self-contained 
Windows apps, and the Collective [23], which aggregates 
a set of software into a portable virtual appliance, also 
require the user to manually specify dependencies. 
VMware ThinApp is a commercial tool that automat- 
ically creates self-contained portable Windows applica- 
tions. However, a user can only create a package by 
having ThinApp monitor the installation of new soft- 
ware [12]. Unlike CDE, ThinApp cannot be used to cre- 
ate packages from existing software already installed on 
a live machine, which is our most common use case. 
Package management systems are often used to install 
Open-source software and their dependencies. Generic 
package managers exist for all major operating systems 
(e.g., RPM for Linux, MacPorts for Mac OS X, Cygwin 
for Windows), and specialized package managers ex- 
ist for ecosystems surrounding many programming lan- 
guages (e.g., CPAN for Perl, RubyGems for Ruby) [4]. 
From the package creator’s perspective, it takes time 
and expertise to manually bundle up one’s software and 
list all dependencies so that it can be integrated into a 
specific package management system. A banal but tricky 
detail that package creators must worry about is adhering 
to platform-specific idioms for pathnames and avoiding 
hard-coding non-portable paths into their programs [25]. 
In contrast, creating a CDE package is as easy as running 
the target program, and hard-coded paths are fine since 
cde-exec redirects all file accesses into the package. 
From the user’s perspective, package managers work 
great as long as the exact desired versions of software 
exist within the system. However, version mismatches 
and conflicts are common frustrations, and installing new 
software can lead to a library upgrade that breaks existing 
software [18]. The Nix package manager is a research 
project that tries to eliminate dependency conflicts via 
stricter versioning, but it still requires package creators to 
manually specify dependencies at creation time [18]. In 
contrast, CDE packages can be run without any installa- 
tion, configuration, or risk of breaking existing software. 
Virtual machine snapshots achieve CDE’s main goal 
of capturing all dependencies required to execute a set of 
programs on another machine. However, they require the 
user to always be working within a VM from the start of 
a project (or else re-install all of their software within a 
new VM). Also, VM snapshot disk images are (by defi- 
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nition) larger than the corresponding CDE packages since 
they must also contain the OS kernel and other extrane- 
ous applications. CDE is a more lightweight solution be- 
cause it enables users to create and run packages natively 
on their own machines rather than through a VM. 


9 Discussion and conclusions 


Our design philosophy underlying CDE is that people 
should be able to package up their Linux software and 
deploy it to run on other Linux machines with as little ef- 
fort as possible. However, we are not proposing CDE as 
a replacement for traditional software installation. CDE 
packages have a number of limitations. Most notably, 


e They are not guaranteed to be complete. 


e Their constituent shared libraries are “frozen” and 
do not receive regular security updates. (Static link- 
ing also shares this limitation.) 


e They run slower than native applications due to 
ptrace overhead. We measured slowdowns of 
up to 28% in our informal experiments (87.3), but 
slowdowns can be worse for I/O-heavy programs. 


Software engineers who are releasing production- 
quality software should obviously take the time to cre- 
ate and test one-click installers or integrate with package 
managers. But for the millions of system administra- 
tors, research scientists, prototype designers, program- 
ming course students and teachers, and hobby hackers 
who just want to deploy their ad-hoc software as quickly 
as possible, CDE can emulate many of the benefits of tra- 
ditional software distribution with much less required la- 
bor: In just minutes, users can create a base CDE pack- 
age by running their program under CDE supervision, use 
our semi-automated heuristic tools to make the package 
complete, deploy to the target Linux machine, and then 
execute it in seamless execution mode to make the target 
program behave like it was installed normally. 

In particular, we believe that the lightweight nature of 
CDE makes it a useful tool in the Linux system admin- 
istrator’s toolbox. Sysadmins need to rapidly and ef- 
fectively respond to emergencies, hack together scripts 
and other utilities on-demand, and run diagnostics with- 
out compromising the integrity of production machines. 
Ad-hoc scripts are notoriously brittle and non-portable 
across Linux distros due to differences in interpreter ver- 
sions (e.g., bash vs. dash shell, Python 2.x vs. 3.x), sys- 
tem libraries, and availability of the often-obscure pro- 
grams that the scripts invoke. Encapsulating scripts and 
their dependencies within a CDE package can make them 
portable across distros and minor kernel versions; we 
have been able to take CDE packages created on 2010- 
era Linux distros and run them on 2006-era distros [20]. 


Lessons learned: We would like to conclude by shar- 
ing some generalizable system design lessons that we 
learned throughout the past year of developing CDE. 


e First and foremost, start with a conceptually-clear 
core idea, make it work for basic non-trivial cases, 
document the still-unimplemented tricky cases, 
launch your system, and then get feedback from real 
users. User feedback is by far the easiest way for 
you to discover what bugs are important to fix and 
what new features to add next. 


e A simple and appealing quick-start webpage guide 
and screencast video demo are essential for attract- 
ing new users. No potential user is going to read 
through dozens of pages of an academic research 
paper before deciding to try your system. In short, 
even hackers need to learn to be great salespeople. 


e To maximize your system’s usefulness, you must 
design it to be easy-to-use for beginners but also to 
allow advanced users to customize it to their liking. 
One way to accomplish this goal is to have well- 
designed default settings, which can be adjusted via 
command-line options or configuration files. The 
defaults must work well “out-of-the-box” without 
any tuning, or else beginners will get frustrated. 


e Resist the urge to add new features just because 
they’re “interesting”, “cool”, or “potentially use- 
ful”. Only add new features when there are com- 
pelling real users who demand it. Instead, focus 
your development efforts on fixing bugs, writing 
more test cases, improving your documentation, 
and, most importantly, attracting new users. 


e Users are the best sources of bug reports, since they 
often stress your system in ways that you could have 
never imagined. Whenever a user reports a bug, try 
to create a representative minimal test case and add 
it to your regression test suite. 


e If auser has a conceptual misunderstanding of how 
your system works, then think hard about how you 
can improve your documentation or default settings 
to eliminate this misunderstanding. 


In sum, get real users, make them happy, and have fun! 
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Abstract 


Managing many computers is difficult. Recent virtual- 
ization trends exacerbate this problem by making it easy 
to create and deploy multiple virtual appliances per phys- 
ical machine, each of which can be configured with dif- 
ferent applications and utilities. This results in a huge 
scaling problem for large organizations as management 
overhead grows linearly with the number of appliances. 

To address this problem, we introduce Strata, a system 
that combines unioning file system and package manage- 
ment semantics to enable more efficient creation, pro- 
visioning and management of virtual appliances. Un- 
like traditional systems that depend on monolithic file 
systems, Strata uses a collection of individual sotware 
layers that are composed together into the Virtual Lay- 
ered File System (VLES) to provide the traditional file 
system view. Individual layers are maintained in a cen- 
tral repository and shared across all file systems that use 
them. Layer changes and upgrades only need to be done 
once in the repository and are then automatically propa- 
gated to all virtual appliances, resulting in management 
overhead independent of the number of appliances. Our 
Strata Linux prototype requires only a single loadable 
kernel module providing the VLFS support and doesn’t 
require any application or source code level kernel mod- 
ifications. Using this prototype, we demonstrate how 
Strata enables fast system provisioning, simplifies sys- 
tem maintenance and upgrades, speeds system recovery 
from security exploits, and incurs only modest perfor- 
mance overhead. 


1 Introduction 


A key problem organizations face is how to efficiently 
provision and maintain the large number of machines de- 
ployed throughout their organizations. This problem is 
exemplified by the growing adoption and use of virtual 
appliances (VAs). VAs are pre-built software bundles run 
inside virtual machines (VMs). Since VAs are often tai- 
lored to a specific application, these configurations can 
be smaller and simpler, potentially resulting in reduced 
resource requirements and more secure deployments. 


While VAs simplify application deployment and de- 
crease hardware costs, they can tremendously increase 
the human cost of administering these machines As VAs 
are cloned and modified, organizations that once had a 
few hardware machines to manage now find themselves 
juggling many more VAs with diverse system configura- 
tions and software installations. 

This causes many management problems. First, as 
these VAs share a lot of common data, they are inefficient 
to store, as there are multiple copies of many common 
files. Second, by increasing the number of systems in 
use, we increase the number of systems needing security 
updates. Finally, machine sprawl, especially non actively 
maintained machines, can give attackers many places to 
hide as well as make attack detection more difficult. In- 
stead of a single actively used machine, administrators 
now have to monitor many irregularly used machines. 

Many approaches have been used to address these 
problems, including diskless clients [5], traditional pack- 
age management systems [6, 1], copy-on-write disks [9], 
deduplication [16] and new VM storage formats [12, 4]. 
Unfortunately, they suffer from various drawbacks that 
limit their utility and effectiveness in practice. They ei- 
ther do not directly help with management, incur man- 
agement overheads that grow linearly with the number of 
VAs, or require a homogenous configuration, eliminating 
the main advantages of VAs. 

The fundamental problem with previous approaches is 
that they are based on a monolithic file system or block 
device. These file systems and block devices address 
their data at the block layer and are simply used as a stor- 
age entity. They have no direct concept of what the file 
system contains or how it is modified. However, man- 
aging VAs is essentially done by making changes to the 
file system. As a result, any upgrade or maintenance op- 
eration needs to be done to each VA independently, even 
when they all need the same maintenance. 

We present Strata, a novel system that integrates file 
system unioning with package management semantics 
and uses the combination to solve VA management prob- 
lems. Strata makes VA creation and provisioning fast. 
It automates the regular maintenance and upgrades that 
must be performed on provisioned VA instances. Finally, 
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it improves the ability to detect and recover from security 
exploits. 

Strata achieves this by providing three architectural 
components: layers, layer repositories, and the Virtual 
Layered File System (VLFS). A layer is a set of files that 
are installed and upgraded as a unit. Layers are analo- 
gous to software packages in package management sys- 
tems. Like software packages, a layer may require other 
layers to function correctly, just as applications often re- 
quire various system libraries to run. Strata associates 
dependency information with each layer that defines re- 
lationships among distinct layers. Unlike software pack- 
ages, which are installed into each VA’s file system, lay- 
ers can be shared directly among multiple VAs. 

Layer repositories are used to store layers centrally 
within a virtualization infrastructure, enabling them to 
be shared among multiple VAs. Layers are updated and 
maintained in the layer repository. When a new version 
of an application becomes available, due to added fea- 
tures or a security patch, a new layer is added to the 
repository. Different versions of the same application 
may be available through different layers in the layer 
repository. The layer repository is typically stored in a 
shared storage infrastructure accessible by the VAs, such 
as an SAN. Storing layers on the SAN does not impact 
VA performance because an SAN is where a traditional 
VA’s monolithic file system is stored. 

The VLFS implements Strata’s unioning mechanism 
and provides the file system for each VA. Like a tradi- 
tional unioning file system, it is a collection of individual 
layers composed into a single view. It enables, a file sys- 
tem to be built out of many shared read-only layers while 
providing each file system with its own private read-write 
layer to contain all file system modifications that occur 
during runtime. In addition, it provides new semantics 
that enable unioning file systems to be used as the ba- 
sis for package management type system. These include 
how layers get added and removed from the union struc- 
ture as well as how the file system handles files deleted 
from a read-only layer. 

Strata, by combining the unioning and package man- 
agement semantics, provides a number of management 
benefits. First, Strata is able to create and provision 
VAs quickly and easily. By leveraging each layer’s de- 
pendency information, Strata allows an administrator to 
quickly create template VAs by only needing to explicitly 
select the application and tool layers of interest. These 
template VAs can then be instantly provisioned by end 
users as no copying or on demand paging is needed to 
instantiate any file system as all the layers are accessed 
from the shared layer repository. 

Second, Strata automates upgrades and maintenance 
of provisioned VAs. If a layer contains a bug to be fixed, 
the administrator only updates the template VA with a 
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replacement layer containing the fix. This automatically 
informs all provisioned VAs to incorporate the updated 
layer into their VLFS’s namespace view, thereby requir- 
ing the fix to only be done once no matter how many 
VAs are deployed. Unlike traditional VAs, who are up- 
dated by replacing an entire file system [12, 4], Strata 
does not need to be rebooted to have these changes take 
effect. Unlike package management, all VLFS changes 
are atomic as no time is spent deleting and copying files. 

Finally, this semantic allows Strata to easily recover 
VAs in the presence of security exploits. The VLFS al- 
lows Strata to distinguish between files installed via its 
package manager, which are stored in a shared read-only 
layer, and the changes made over time, which are stored 
in the private read-write layer. If a VA is compromised, 
the modifications will be confined to the VLFS’s pri- 
vate read-write layer, thereby making the changes easy 
to both identify and remove. 

We have implemented a Strata Linux prototype with- 
out any application or source code operating system ker- 
nel changes and provide the VLFS as a loadable kernel 
module. We show that by combining traditional pack- 
age management with file system unioning we provide 
powerful new functionality that can help automate many 
machine management tasks. We have used our proto- 
type with VMware ESX virtualization infrastructure to 
create and manipulate a variety of desktop and server 
VAs to demonstrate its utility for system provisioning, 
system maintenance and upgrades, and system recovery. 
Our experimental results show that Strata can provision 
VAs in only a few seconds, can upgrade a farm of fifty 
VAs with several different configurations in less than two 
minutes, and has scalable storage requirements and mod- 
est file system performance overhead. 


2 Related Work 


The most common way to provision and maintain ma- 
chines today is using the package management system 
built into the operating system [6, 1]. Package manage- 
ment provides a number of benefits. First, it divides the 
installable software into independent chunks called pack- 
ages. When one wants to install a piece of software or 
upgrade an already installed piece of software, all one 
has to do is download and install that single item. Sec- 
ond, these packages can include dependency information 
that instructs the system about what other packages must 
be installed with this package. This enables tools [2, 10] 
to automatically determine the entire set of packages one 
needs to install when one wants to install a piece of soft- 
ware, making it significantly easier for an end-user to 1n- 
stall software. 

However, package managers view the file system as a 
simple container for files and not as a partner in the man- 


USENIX Association 


USENIX Association 


agement of the machine. This causes them to suffer from 
a number of flaws in their management of large numbers 
of VAs. They are not space or time efficient, as each pro- 
visioned VA requires time-consuming copying of many 
megabytes or gigabytes into each VA’s file system. These 
inefficiencies affect both provisioning and updating of a 
system as a lot of time is spent, downloading, extract- 
ing and installing the individual packages into the many 
independent VAs. 

As the package manager does not work in partnership 
with the file system, the file system does not distinguish 
between a file installed from a package and a file modi- 
fied or created in the course of usage. Specialized tools 
are needed to traverse the entire file system to determine 
if a file has been modified and therefore compromised. 
Finally, package management systems work in the con- 
text of a running system to modify the file system di- 
rectly. These tools often cannot not work if the VA is 
suspended or turned off. 

For local scenarios, the size and time efficiencies of 
provisioning a VA can be improved by utilizing copy- 
on-write (COW) disks, such as QEMU’s QCOW?2 [9] 
format. These enables VAs to be provisioned quickly, 
as little data has to be written to disk immediately due 
to the COW property. However, once provisioned, each 
COW copy is now fully independent from the original, is 
equivalent to a regular copy, and therefore suffers from 
all the same maintenance problems as a regular VA. Even 
if the original disk image is updated, the changes would 
be incompatible with the cloned COW images. This is 
because COW disks operate at the block level. As files 
get modified, they use different blocks on their underly- 
ing device. Therefore, it is likely that the original and 
cloned COW images address the same blocks for differ- 
ent pieces of data. For similar reasons, COW disks do not 
help with VA creation, as multiple COW disks cannot be 
combined together into a single disk image. 

Both the Collective [4] and Ventana [12] attempt to 
solve the VA maintenance problem by building upon 
COW concepts. Both systems enable VAs to be provi- 
sioned quickly by performing a COW copy of each VA’s 
system file system. However, they suffer from the fact 
that they manage this file system at either the block de- 
vice or monolithic file system level, providing users with 
only a single file system. While ideally an administra- 
tor could supply a single homogeneous shared image for 
all users, in practice, users want access to many heteroge- 
neous images that must be maintained independently and 
therefore increase the administrator’s work. The same 
is true for VAs provisioned by the end user, while they 
both enable the VAs to maintain a separate disk from the 
shared system disk that persists beyond upgrades. 

Mirage [17] attempts to improve the disk image sprawl 
problem by introducing a new storage format, the Mi- 


rage Index Format (MIF), to enumerate what files be- 
long to a package. However, it does not help with the 
actual image sprawl in regard to machine maintenance, 
because each machine reconstituted by Mirage still has a 
fully independent file system, as each image has its own 
personal copy. Although each provisioned machine can 
be tracked, they are now independent entities and suffer 
from the same problems as a traditional VA. 

Stork [3] improves on package management for 
container-based systems by enabling containers to hard 
link to an underlying shared file system so that files are 
only stored once across all containers. By design, it can- 
not help with managing independent machines, virtual 
machines, or VAs, because hard links are a function in- 
ternal to a specific file system and not usable between 
separate file systems. 

Union file systems [11, 19] provide the ability to com- 
pose multiple different file namespaces into a single 
view. Unioning file systems are commonly used to pro- 
vide a COW file system from a read-only copy, such as 
with Live-CDs. However, unioning file system by them- 
selves do not directly help with VA management, as the 
underlying file system has to be maintained using regular 
tools. Strata builds upon and leverages this mechanism 
by improving its ability to handle deleted files as well 
as managing the layers that belong to the union. This 
allows Strata to provide a solution that enables efficient 
provisioning and management of VAs. 

Strata focuses on improving virtual appliance manage- 
ment, but the VLFS idea can be used to address other 
management and security problems as well. For exam- 
ple, our previous work on Apiary [14] demonstrates how 
the VLFS can be combined with containers to provide 
a transparent desktop application fault containment ar- 
chitecture that is effective at limiting the damage from 
exploits to enable quick recovery while being as easy to 
use as a traditional desktop system. 


3 Strata Basics 


Figure | shows Strata’s three architectural components: 
layers, layer repositories, and VLFSs. A layer is a dis- 
tinct self-contained set of files that corresponds to a spe- 
cific functionality. Strata classifies layers into three cat- 
egories: software layers with self-contained applications 
and system libraries, configuration layers with configu- 
ration file changes for a specific VA, and private layers 
allowing each provisioned VA to be independent. Lay- 
ers can be mixed and matched, and may depend on other 
layers. For example, a single application or system li- 
brary is not fully independent, but depends on the pres- 
ence of other layers, such as those that provide needed 
shared libraries. Strata enables layers to enumerate their 
dependencies on other layers. This dependency scheme 
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Figure 1: How Strata’s Components Fit Together 


allows automatic provisioning of a complete, fully con- 
sistent file system by selecting the main features desired 
within the file system. 

Layers are provided through layer repositories. As 
Figure 1 shows, a layer repository is a file system share 
containing a set of layers made available to VAs. When 
an update is available, the old layer is not overwritten. 
Instead, a new version of the layer is created and placed 
within the repository, making it available to Strata’s 
users. Administrators can also remove layers from the 
repository, e.g., those with known security holes, to pre- 
vent them from being used. Layer repositories are gen- 
erally stored on centrally managed file systems, such as 
a SAN or NES, but they can also be provided by proto- 
cols such as FTP and HTTP and mirrored locally. Layers 
from multiple layer repositories can form a VLFS as long 
as they are compatible with one another. This allows lay- 
ers to be provided in a distributed manner. Layers pro- 
vided by different maintainers can have the same layer 
names, causing a conflict. This, however, is no different 
from traditional package management systems as pack- 
ages with the same package name, but different function- 
ality, can be provided by different package repositories. 

As Figure 1 shows, a VLFS is a collection of layers 
from layer repositories that are composed into a single 
file system namespace. The layers making up a particu- 
lar VLFS are defined by the VLFS’s layer definition file 
(LDF), which enumerates all the layers that will be com- 
posed into a single VLFS instance. To provision a VLFS, 
an administrator selects software layers that provide the 
desired functionality and lists them in the VLFS’s LDF. 

Within a VLFS, layers are stacked on top of another 
and composed into a single file system view. An impli- 
cation of this composition mechanism is that layers on 
top can obscure files on layers below them, only allow- 
ing the contents of the file instance contained within the 
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higher level to be used. This means that files in the pri- 
vate or configuration layers can obscure files in lower 
layers, such as when one makes a change to a default 
version of a configuration file located within a software 
layer. However, to prevent an ambiguous situation from 
occurring, where the file system’s contents depend on the 
order of the software layers, Strata prevents software lay- 
ers that contain a subset of the same file from being com- 
posed into a single VLFS. 


4 Using Strata 


Strata’s usage model is centered around the usage of lay- 
ers to quickly create VLFSs for VAs as shown in Fig- 
ure 1. Strata allows an administrator to compose together 
layers to form template VAs. These template VAs can be 
used to form other template appliances that extend their 
functionality, as well as to provide the VA that end users 
will provision and use. Strata is designed to be used 
within the same setup as a traditional VM architecture. 
This architecture includes a cluster of physical machines 
that are used to host VM execution as well as a shared 
SAN that stores all of the VM images. However, instead 
of storing complete disk images on the SAN, Strata uses 
the SAN to store the layers that will be used by the VMs 
it manages. 


4.1 Creating Layers and Repositories 


Layers are first created and stored in layer repositories. 
Layer creation is similar to the creation of packages in 
a traditional package management system, where one 
builds the software, installs it into a private directory, 
and turns that directory into a package archive, or in 
Strata’s case, a layer. For instance, to create a layer 
that contains the MySQL SQL server, the layer main- 
tainer would download the source archive for MySQL, 
extract it, and build it normally. However, instead of in- 
stalling it into the system’s root directory, one installs 
it into a virtual root directory that becomes the file sys- 
tem component of this new layer. The layer maintainer 
then defines the layer’s metadata, including its name 
(mysql—server in this case) and an appropriate ver- 
sion number to uniquely identify this layer. Finally, the 
entire directory structure of the layer is copied into a file 
system share that provides a layer repository, making the 
layer available to users of that repository. 


4.2 Creating Appliance Templates 


Given a layer repository, an administrator can then cre- 
ate template VAs. Creating a template VA involves: (1) 
Creating the template VA with an identifiable name. (2) 
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Determining what repositories are available to it. (3) Se- 
lecting a set of layers that provide the functionality de- 
sired. 

To create a template VA that provides a MySQL 
SQL server, an administrator creates an appliance/VLFS 
named sqi-—server and selects the layers needed for a 
fully functional MySQL server file system, most impor- 
tantly, the mysql-server layer. Strata composes these lay- 
ers together into the VLFS in a read-only manner along 
with a read-write private layer, making the VLFS us- 
able within a VM. The administrator boots the VM and 
makes the appropriate configuration changes to the tem- 
plate VA, storing them within the VLFS’s private layer. 
Finally, the private layer belonging to the template appli- 
ance’s VLFS is converted into the template’s read-only 
configuration layer by being moved to a SAN file-system 
that the VAs can only access in a read-only manner. As 
another example, to create an Apache web server appli- 
ance, an administrator creates an appliance/VLFS named 
web-server, and selects the layers required for an 
Apache web server, most importantly, the layer contain- 
ing the Apache program. 

Strata extends this template model by allowing multi- 
ple template VAs to be composed together into a single 
new template. An administrator can create a new tem- 
plate VA/VLFS, sql+web-server, composed of the 
MySQL and Apache template VAs. The resulting VLFS 
has the combined set of software layers from both tem- 
plates, both of their configuration layers, and a new con- 
figuration layer containing the configuration state that in- 
tegrates the two services together, for a total of three con- 
figuration layers. 


4.3. Provisioning and Running Appliance 
Instances 


In Strata, a VLFS can be created by building off a pre- 
viously defined VLFS set of layers and combining those 
layers with a new read-write private layer. Therefore, 
given previously defined templates, Strata enables VAs 
to be efficiently and quickly provisioned and deployed 
by end users. Provisioning a VA involves (1) creating 
a virtual machine container with a network adapter and 
an empty virtual disk, (2) using the network adapter’s 
unique MAC address as the machine’s identifier for iden- 
tifying the VLFS created for this machine, and (3) form- 
ing the VLFS by referencing the already existing respec- 
tive template VLFS and combining the template’s read- 
only software and configuration layers with a read-write 
private layer provided by the VM’s virtual disk. 

As each VM managed by Strata does not have a phys- 
ical disk off which to boot, Strata network boots each 
VM. When the VM boots, its BIOS discovers a network 
boot server which provides it with a boot image, includ- 


ing a base Strata environment. The VM boots this base 
environment, which then determines which VLFS should 
be mounted for the provisioned VM using the MAC ad- 
dress of the machine. Once the proper VLFS is mounted, 
the machine transitions to using it as its root file system. 


4.4 Updating Appliances 


Strata upgrades provisioned VAs efficiently using a sim- 
ple three-step process. First, an updated layer is installed 
into a shared layer repository. Second, administrators are 
able to modify the template appliances under their con- 
trol to incorporate the update. Finally, all provisioned 
VAs based on that template will automatically incorpo- 
rate the update as well. Note that updating appliances 
is much simpler than updating generic machines, as ap- 
pliances are not independently managed machines. This 
means that extra software that can conflict with an up- 
grade will not be installed into a centrally managed ap- 
pliance. Centrally managed appliance updates are lim- 
ited to changes to their configuration files and what data 
files they store. 

Strata’s updates propagate automatically even if the 
VA is not currently running. If a provisioned VA is shut 
down, the VA will compose whatever updates have been 
applied to its templates automatically, never leaving the 
file system in a vulnerable state, because it composes its 
file system afresh each time it boots. If it is suspended, 
Strata delays the update to when the VA is resumed, as 
updating layers is a quick task. Updating is significantly 
quicker than resuming, so this does not add much to its 
cost. 

Furthermore, VAs are upgraded atomically, as Strata 
adds and removes all the changed layers in a single oper- 
ation. In contrast, traditional package management sys- 
tem, when upgrading a package, first uninstalls it before 
reinstalling the newer version. This traditional method 
leaves the file system in an inconsistent state for a short 
period of time. For instance, when the libc package is up- 
graded, its contents are first removed from the file system 
before being replaced. Any application that tries to exe- 
cute during the interim will fail to dynamically link be- 
cause the main library on which it depends is not present 
within the file system at that moment. 


4.5 Improving Security 


Strata makes it much easier to manage VAs that have had 
their security compromised. By dividing a file system 
into a set of shared read-only layers and storing all file 
system modifications inside the private read-write layer, 
Strata separates changes made to the file system via layer 
management from regular runtime modifications. This 
enables Strata to easily determine when system files have 
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been compromised, because making a compromise per- 
sistent requires the file system be modified, modifying or 
adding files to the file system to create a compromise will 
be readily visible in the private layer. This allows Strata 
to not rely on tools like Tripwire [8] or maintain sepa- 
rate databases to determine if files have been modified 
from their installed state. Similarly, this check can be 
run external to the VA, as it just needs access to the pri- 
vate layer, thereby preventing an attacker from disabling 
it. This reduces management load due to not requiring 
any external databases be kept in sync with the file sys- 
tem state as it changes. While an attacker could try to 
compromise files on the shared layers, they would have 
to exploit the SAN containing the layer repository. In 
a regular virtualization architecture, if an attacker could 
exploit the SAN, he would also have access to all 

This segregation of modified file system state also en- 
ables quick recovery from a compromised system. By 
sreplacing the VA’s private layer with a fresh private 
layer, the compromised system is immediately fixed and 
returned to its default, freshly provisioned state. How- 
ever, unlike reinstalling a system from scratch, replacing 
the private layer does not require throwing away the con- 
tents of the old private layer. Strata enables the layer 
to be mounted within the file system, enabling admin- 
istrators to have easy access to the files located within 
it to move the uncompromised files back to their proper 
place. 


5 Strata Architecture 


Strata introduces the concept of a virtual layered file 
system in place of traditional monolithic file systems. 
Strata’s VLFS allows file systems to be created by com- 
posing layers together into a single file system names- 
pace view. Strata allows these layers to be shared by 
multiple VLFSs in a read-only manner or to remain read- 
write and private to a single VLFS. 

Every VLFS is defined by a layer definition file, which 
specifies what software layers should be composed to- 
gether. An LDF is a simple text file that lists the layers 
and their respective repositories. The LDF’s layer list 
syntax is repository/layer version and can be 
proceeded by an optional modifier command. When an 
administrator wants to add or remove software from the 
file system, instead of modifying the file system directly, 
they modify the LDF by adding or removing the appro- 
priate layers. 

Figure 2 contains an example LDF for a MySQL SQL 
server template appliance. The LDF lists each individual 
layer included in the VLEFS along with its correspond- 
ing repository. Each layer also has a number indicating 
which version will be composed into the file system. If 
an updated layer is made available, the LDF is updated 
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main/mysgl-server 5.0.5la-3 


main/base 1 

main/libdb4 2 4,.2.52-16 
main/apt-utils 0.5.28.6 
main/liblocale=qelttext=perl 1,01=17 
main/libtext-—-charwidth-perl 0.04-1 
main/libtext-—-iconv-perl 1.2-3 
main/libtext-—wrapil8n-perl 0.06-1 
main/debconf 1.4.30.13 

main/tcpd 7.6-8 

main/libgdoms 1.8.3=2 

main/perl 5.8.4=8 

main/psmise 21.5-1 

Matin dibeslU..O 7 0.9.7e=5 
main/liblockfilel 1.06 
main/adduser 3.63 
main/libreadline4 4.3-11 
main/libnet-—daemon-perl 0.38-1 
Main/ li bplepe—perl 0. 2017—1 
main/libdbi-perl 1.46-6 

Main/ssmtp 2.61-2 

=main/mailx 3a8.1.2-0.20040524cvs-4 


Figure 2: LDF for MySQL Server Template 


to include the new layer version instead of the old one. 
If the administrator of the VLFS does not want to up- 
date the layer, they can hold a layer at a specific version, 
with the = syntax element. This is demonstrated by the 
mailx layer in Figure 2, which is being held at the ver- 
sion listed in the LDF. 

Strata allows an administrator to select explicitly only 
the few layers corresponding to the exact functionality 
desired within the file system. Other layers needed in 
the file system are implicitly selected by the layers’ de- 
pendencies as described in Section 5.2. Figure 2 shows 
how Strata distinguishes between explicitly and implic- 
itly selected layers. Explicitly selected layers are listed 
first and separated from the implicitly selected layers 
by a blank line. In this case, the MySQL server has 
only one explicit layer, mysql-server, but has 21 implic- 
itly selected layers. These include utilities such as Perl 
and TCP Wrappers (tcpd), as well as libraries such as 
OpenSSL (libssl). Like most operating systems that re- 
quire a minimal set of packages to always be installed, 
Strata also always includes a minimal set of shared layers 
that are common to all VLFSs that it denotes as base. In 
our Strata prototype, these are the layers that correspond 
to packages that Debian makes essential and are there- 
fore not removable. Strata also distinguishes explicit lay- 
ers from implicit layers to allow future reconfigurations 
to remove one implicit layer in favor of another if depen- 
dencies need to change. 

When an end user provisions an appliance by cloning a 
template, an LDF is created for the provisioned VA. Fig- 
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@main/sgql-server 


Figure 3: LDF for Provisioned MySQL Server VA 


ure 3 shows an example introducing another syntax ele- 
ment, @, that instructs Strata to reference another VLFS’s 
LDF as the basis for this VLFS. This lets Strata clone the 
referenced VLFS by including its layers within the new 
VLES. In this case, because the user wants only to de- 
ploy the SQL server template, this VLFS LDF only has 
to include the single @ line. In general, a VLFS can refer- 
ence more than one VLFS template, assuming that layer 
dependencies allow all the layers to coexist. 


5.1 Layers 


Strata’s layers are composed of three components: meta- 
data files, the layer’s file system, and configuration 
scripts. They are stored on disk as a directory tree 
named by the layer’s name and version. For instance, 
version 5.0.5la of the MySQL server, with a strata 
layer version of 3, would be stored under the directory 
mysql-server_5.0.51a-—3. Within this directory, 
Strata defines a metadata file, a filesystem di- 
rectory, and a scripts directory corresponding to the 
layer’s three components. 

The metadata files define the information that de- 
scribes the layer. This includes its name, version, and 
dependency information. This information is impor- 
tant to ensure that a VLFS is composed correctly. The 
metadata file contains all the metadata that is speci- 
fied for the layer. Figure 4 shows an example metadata 
file. Figure 5 shows the full metadata syntax. The meta- 
data file has a single field per line with two elements, the 
field type and the field contents. In general, the metadata 
file’s syntax is Field Type: value, where value 
can be either a single entry or a comma-separated list of 
values. 

The layer’s file system is a self-contained set of files 
providing a specific functionality. The files are the indi- 
vidual items in the layer that are composed into a larger 
VLES. There are no restrictions on the types of files that 
can be included. They can be regular files, symbolic 
links, hard links, or device nodes. Similarly, each di- 
rectory entry can be given whatever permissions are ap- 
propriate. A layer can be seen as a directory stored on 
the shared file system that contains the same file and di- 
rectory structure that would be created if the individual 
items were installed into a traditional file system. On a 
traditional UNIX system, the directory structure would 
typically contain directories such as /usr, /bin and 
/etc. Symbolic links work as expected between layers 
since they work on path names, but one limitation is that 
hard links cannot exist between layers. 

The layer’s configuration scripts are run when a layer 


Layer: mysqli-server 

Version: 5.0.5la-3 

Depends: ..., perl (>= 5.6), 
Cepd (= 196-4) >-2 64 


Figure 4: Metadata for MySQL-Server Layer 


Layer: Layer Name 
Version: Version of Layer Unit 


Conftlaccs: iayerl. (opt. constraint), 


Depends: layerl (...), 
layer2 (...) | layer3, 
Pre-Depends: layerl (...), 


Provides: virtual_layer, 


Figure 5: Metadata Specification 


is added or removed from a VLFS to allow proper in- 
tegration of the layer within the VLFS. Although many 
layers are just a collection of files, other layers need to 
be integrated into the system as a whole. For example, 
a layer that provides mp3 file playing capability should 
register itself with the system’s MIME database to allow 
programs contained within the layer to be launched au- 
tomatically when a user wants to play an mp3 file. Simi- 
larly, if the layer were removed, it should remove the pro- 
grams contained within itself from the MIME database. 
Strata supports four types of configuration scripts: pre- 
remove, post-remove, pre-install, and post-install. If they 
exist in a layer, the appropriate script is run before or 
after a layer is added or removed. For example, a pre- 
remove script can be used to shut down a daemon before 
it is actually removed, while a post-remove script can 
be used to clean up file system modifications in the pri- 
vate layer. Similarly, a pre-install script can ensure that 
the file system is as the layer expects, while the post- 
install script can start daemons included in the layer. The 
configuration scripts can be written in any scripting lan- 
guage. The layer must include the proper dependencies 
to ensure that the scripting infrastructure is composed 
into the file system in order to allow the scripts to run. 


5.2 Dependencies 


A key Strata metadata element is enumeration of the de- 
pendencies that exist between layers. Strata’s depen- 
dency scheme is heavily influenced by the dependency 
scheme in Linux distributions such as Debian and Red 
Hat. In Strata, every layer composed into Strata’s VLFS 
is termed a layer unit. Every layer unit is defined by its 
name and version. Two layer units that have the same 
name but different layer versions are different units of 
the same layer. A /ayer refers to the set of layer units 
of a particular name. Every layer unit in Strata has a 
set of dependency constraints placed within its metadata. 
There are four types of dependency constraints: (a) de- 
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pendency, (b) pre-dependency, (c) conflict and (d) pro- 
vide. 

Dependency and Pre-Dependency: Dependency and 
pre-dependency constraints are similar in that they re- 
quire another layer unit to be integrated at the same 
time as the layer unit that specifies them. They differ 
only in the order the layer’s configuration scripts are ex- 
ecuted to integrate them into the VLFS. A regular de- 
pendency does not dictate order of integration. A pre- 
dependency dictates that the dependency has to be inte- 
grated before the dependent layer. Figure 4 shows that 
the MySQL layer depends on TCP Wrappers, (t cpd), 
because it dynamically links against the shared library 
libwrap.so.0 provided by TCP Wrappers. MySQL 
cannot run without this shared library, so the layer units 
that contain MySQL must depend on a layer unit contain- 
ing an appropriate version of the shared library. These 
constraints can also be versioned to further restrict which 
layer units satisfy the constraint. For example, shared 
libraries can add functionality that breaks their applica- 
tion binary interface (ABI), breaking in turn any applica- 
tions that depend on that ABI. Since MySQL is compiled 
against version 0.7.6 of the libwrap library, the depen- 
dency constraint is versioned to ensure that a compatible 
version of the library is integrated at the same time. 

Conflict: Conflict constraints indicate that layer units 
cannot be integrated into the same VLES. There are mul- 
tiple reasons this can occur, but it is generally because 
they depend on exclusive access to the same operating 
system resource. This can be a TCP port in the case of 
an Internet daemon, or two layer units that contain the 
same file pathnames and therefore would obscure each 
other. For this reason, Strata defines that two layer units 
of the same layer are by definition in conflict because 
they will contain some of the same files. 

An example of this constraint occurs when the ABI 
of a shared library changes without any source code 
changes, generally due to an ABI change in the tool 
chain that builds the shared library. Because the ABI 
has changed, the new version can no longer satisfy any 
of the previous dependencies. But because nothing else 
has changed, the file on disk will usually not be renamed 
either. A new layer must then be created with a different 
name, ensuring that the library with the new ABI is never 
used to satisfy an old dependency on the original layer. 
Because the new layer contains the same files as the old 
layer, it must conflict with the older layer to ensure that 
they are not integrated into the same file system. 

Provide: Provide dependency constraints introduce 
virtual layers. A regular layer provides a specific set of 
files, but a virtual layer indicates that a layer provides 
a particular piece of general functionality. Layer units 
that depend on a certain piece of general functionality 
can depend on a specific virtual layer name in the normal 
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manner, while layer units that provide that functionality 
will explicitly specify that they do. For example, layer 
units that provide HTML documentation depend on the 
presence of a web server to enable a user to view them, 
but which one is not important. Instead of depending 
on a particular web server, they depend on the virtual 
layer name httpd. Similarly, layer units containing a 
web server and obeying system policy for the location of 
static html content, such as Apache or Boa, are defined 
to provide the httpd virtual layer name and therefore 
satisfy those dependencies. Unlike regular layer units, 
virtual layers are not versioned. 


Example: Figure 2 shows how dependencies can af- 
fect a VLFS in practice. This VLFS has only one ex- 
plicit layer, mysql-server, but 21 implicitly selected lay- 
ers. The mysql-server layer itself has a number of di- 
rect dependencies, including Perl, TCP Wrappers, and 
the mailx program. These dependencies in turn de- 
pend on the Berkeley DB library and the GNU dbm li- 
brary, among others. Using its dependency mechanism, 
Strata is able to automatically resolve all the other lay- 
ers needed to create a complete file system by specifying 
just a single layer 


Returning to Figure 4, this example defines a subset 
of the layers that the mysql-server layer requires to be 
composed into the same VLFS to allow MySQL to run 
correctly. More generally, Figure 5 shows the complete 
syntax for the dependency metadata. Provides is the sim- 
plest, with only a comma separated list of virtual layer 
names. Conflicts adds an optional version constraint to 
each conflicted layer to limit the layer units that are actu- 
ally in conflict. Depends and Pre-Depends add a boolean 
OR of multiple layers in their dependency constraints to 
allow multiple layers to satisfy the dependency. 


Resolving Dependencies: To allow an administra- 
tor to select only the layers explicitly desired within the 
VLES, Strata automatically resolves dependencies to de- 
termine which other layers must be included implicitly. 


Linux distributions already face this problem and tools 
have been developed to address it, such as Apt [2] and 
Smart [10]. To leverage Smart, Strata adopts the same 
metadata database format that Debian uses for packages 
for its own layers. In Strata, when an administrator 
requests that a layer be added to or removed from a tem- 
plate appliance, Smart also evaluates if the operation can 
succeed and what is the best set of layers to add or re- 
move. Instead of acting directly on the contents of the 
file system, however, Strata only has to update the tem- 
plate’s VLFS’s definition file with the set of layers to be 
composed into the file system. 
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5.3. Layer Creation 


Strata allows layers to be created in two ways. First, 
Strata allows the . deb packages used by Debian-derived 
distributions and the . rpm packages used by RedHat- 
derived distributions to be converted into layers that 
Strata users can use. Strata converts packages into lay- 
ers in two steps. First, Strata extracts the relevant meta- 
data from the package, including its name and version. 
Second, Strata extracts the package’s file contents into a 
private directory that will be the layer’s file system com- 
ponents. When using converted packages, Strata lever- 
ages the underlying distribution’s tools to run the con- 
figuration scripts belonging to the newly created layers 
correctly. Instead of using the distribution’s tools to un- 
pack the software package, Strata composes the layers 
together and uses the distribution’s tools as though the 
packages have already been unpacked. Although Strata 
is able to convert packages from different Linux distri- 
butions, it cannot mix and match them because they are 
generally ABI incompatible with one another. 


More commonly, Strata leverages existing packaging 
methodologies to simplify the creation of layers from 
scratch. In a traditional system, when administrators in- 
stall a set of files, they copy the files into the correct 
places in the file system using the root of the file sys- 
tem tree as their starting point. For instance, an admin- 
istrator might run make install to install a piece of 
software compiled on the local machine. But in Strata 
layer creation is a three step process. First, instead of 
copying the files into the root of the local file system, 
the layer creator installs the files into their own specific 
directory tree. That is, they make a blank directory to 
hold a new file system tree that is created by having the 
make install copy the files into a tree rooted at that 
directory, instead of the actual file system root. 


Second, the layer maintainer extracts programs that in- 
tegrate the files into the underlying file system and cre- 
ates scripts that run when the layer is added to and re- 
moved from the file system. Examples of this include 
integration with Gnome’s GConf configuration system, 
creation of encryption keys, or creation of new local 
users and groups for new services that are added. This 
leverages skills that package maintainers in a traditional 
package management world already have. 


Finally, the layer maintainer needs to set up the meta- 
data correctly. Some elements of the metadata, such as 
the name of the layer and its version, are simple to set, 
but dependency information can be much harder. But 
because package management tools have already had to 
address this issue, Strata is able to leverage the tools they 
have built. For example, package management systems 
have created tools that infer dependencies using an exe- 
cutable dynamically linking against shared libraries [15]. 


Instead of requiring the layer maintainer to enumerate 
each shared library dependency, we can programmati- 
cally determine which shared libraries are required and 
populate the dependency fields based on those versions 
of the library currently installed on the system where the 
layer is being created. 


5.4 Layer Repositories 


Strata provides local and remote layer repositories. Local 
layer repositories are provided by locally accessible file 
system shares made available by a SAN. They contain 
layer units to be composed into the VLFS. This is sim- 
ilar to a regular virtualization infrastructure in which all 
the virtual machines’ disks are stored on a shared SAN. 
Each layer unit is stored as its own directory; a local layer 
repository contains a set of directories, each of which 
corresponds to a layer unit. The local layer repository’s 
contents are enumerated in a database file providing a 
flat representation of the metadata of all the layer units 
present in the repository. The database file is used for 
making a list of what layers can be installed and their de- 
pendency information. By storing the shared layer repos- 
itory on the SAN, Strata lets layers be shared securely 
among different users’ appliances. Even if the machine 
hosting the VLFS is compromised, the read-only layers 
will stay secure, as the SAN will enforce the read-only 
semantic independently of the VLFS. 

Remote layer repositories are similar to local layer 
repositories, but are not accessible as file system shares. 
Instead, they are provided over the Internet, by protocols 
such as FTP and HTTP, and can be mirrored into a local 
layer repository. Instead of mirroring the entire remote 
repository, Strata allows on-demand mirroring, where all 
the layers provided by the remote repository are acces- 
sible to the VAs, but must be mirrored to the local mir- 
ror before they can be composed into a VLEFS. This al- 
lows administrators to store only the needed layers while 
maintaining access to all the layers and updates that the 
repository provides. Administrators can also filter which 
layers should be available to prevent end users from us- 
ing layers that violate administration policy. In general, 
an administrator will use these remote layer repositories 
to provide the majority of layers, much as administrators 
use a publicly managed package repository from a regu- 
lar Linux distribution. 

Layer repositories let Strata operate within an enter- 
prise environment by handling three distinct yet related 
issues. First, Strata has to ensure that not all end users 
have access to every layer available within the enterprise. 
For instance, administrators may want to restrict certain 
layers to certain end users for licensing or security rea- 
sons. Second, as enterprises get larger, they gain levels 
of administration. Strata must support the creation of an 
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enterprise-wide policy while also enabling small groups 
within the enterprise to provide more localized admin- 
istration. Third, larger enterprises supporting multiple 
operating systems cannot rely on a single repository of 
layers because of inherent incompatibilities among oper- 
ating systems. 

By allowing a VLFS to use multiple repositories, 
Strata solves these three problems. First, multiple reposi- 
tories let administrators compartmentalize layers accord- 
ing to the needs of their end users. By providing end 
users with access only to needed repositories, organiza- 
tions prevent their end users from using the other layers. 
Strata depends on traditional file system access control 
mechanisms to enforce these permissions. Second, by al- 
lowing sub-organizations to set up their own repositories, 
Strata lets a sub-organization’s administrator provide the 
layers that end users need without requiring intervention 
by administrators of global repositories. Finally, multi- 
ple repositories allow Strata to support multiple operat- 
ing systems, as each distinct operating system has its own 
set of layer repositories. 


5.5 VLFS Composition 


To create a VLFS, Strata has to solve a number of file 
system-related problems. First, Strata has to support the 
ability to combine numerous distinct file system layers 
into a single static view. This is equivalent to installing 
software into a shared read-only file system. Second, be- 
cause users expect to treat the VLFS as a normal file sys- 
tem, for instance, by creating and modifying files, Strata 
has to let VLFSs be fully modifiable. By the same token, 
users must also be able to delete files that exist on the 
read-only layer. 

By basing the VLFS on top of unioning file sys- 
tems [11, 19], Strata solves all these problems. Unioning 
file systems join multiple layers into a single namespace. 
Unioning file systems have been extended to apply at- 
tributes such as read-only and read-write to their layers. 
The VLES leverages this property to force shared lay- 
ers to be read-only, while the private layer remains read- 
write. If a file from a shared read-only layer is mod- 
ified, it is copied-on-write (COW) to the private read- 
write layer before it is modified. For example, Live-CDs 
use this functionality to provide a modifiable file system 
on top of the read-only file system provided by the CD. 
Finally, unioning file systems use white-outs to obscure 
files located on lower layers. For example, if a file lo- 
cated on a read-only layer is deleted, a white-out file will 
be created on the private read-write layer. This file is 1n- 
terpreted specially by the file-system and is not revealed 
to the user while also preventing the user from seeing 
files with the same name. 

But end users need to be able to recover deleted files 
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by reinstalling or upgrading the layer containing them. 
This is equivalent to deleting a file from a traditional 
monolithic file system, but reinstalling the package con- 
taining the file in order to recover it. Also, Strata sup- 
ports adding and removing layers dynamically without 
taking the file system off line. This is equivalent to 
installing, removing, or upgrading a software package 
while a monolithic file system 1s online. 


Unlike a traditional file system, where deleted system 
files can be recovered simply by reinstalling the package 
containing that file, in Strata, white-outs in the private 
layer persist and continue to obscure the file even if the 
layer is replaced. To solve this problem, Strata provides 
a VLES with additional writeable layers associated with 
each read-only shared layer. Instead of containing file 
data, as does the topmost private writeable layer, these 
layers just contain white-out marks that will obscure files 
contained within their associated read-only layer. The 
user can delete a file located in a shared read-only layer, 
but the deletion only persists for the lifetime of that par- 
ticular instance of the layer. When a layer is replaced 
during an upgrade or reinstall, a new empty white-out 
layer will be associated with the replacement, thereby 
removing any preexisting white-outs. In a similar way, 
Strata handles he case where a file belonging to a shared 
read-only layer is modified and therefore copied to the 
VLFS’s private read-write layer. Strata provides a revert 
command that lets the owner of a file that has been mod- 
ified revert the file to its original pristine state. While a 
regular VLFS unlink operation would have removed the 
modified file from the private layer and created a white- 
out mark to obscure the original file, revert only removes 
the copy in the private layer, thereby revealing the origi- 
nal below it. 

Strata also allows a VLFS to be managed while be- 
ing used. Some upgrades, specifically of the kernel, will 
require the VA to be rebooted, but most should be able 
to occur without taking the VA off line. However, if a 
layer is removed from a union, the data is effectively re- 
moved as well because unions operate only on file system 
namespaces and not on the data the underlying files con- 
tain. If an administrator wants to remove a layer from 
the VLFS, they must take the VA off line, because layers 
cannot be removed while in use. 


To solve this problem, Strata emulates a traditional 
monolithic file system. When an administrator deletes 
a package containing files in use, the processes that are 
currently using those files will continue to work. This 
occurs by virtue of unlink’s semantic of first removing 
a file from the file system’s namespace, and only remov- 
ing its data after the file is no longer in use. This lets 
processes continue to run because the files they need will 
not be removed until after the process terminates. This 
creates a semantic in which a currently running program 
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can be using versions of files no longer available to other 
programs. 

Existing package managers use this semantic to allow 
a system to be upgraded online, and it is widely under- 
stood. Strata applies the same semantic to layers. When 
a layer is removed from a VLFS, Strata marks the layer 
as unlinked, removing it from the file system names- 
pace. Although this layer is no longer part of the file 
system namespace and thus cannot be used by any oper- 
ations such as open that work on the namespace, it does 
remain part of the VLFS, enabling data operations such 
as read and write to continue working correctly for 
previously opened files. 


6 Experimental Results 


We have implemented Strata’s VLFS as a loadable kernel 
module on an unmodified Linux 2.6 series kernel as well 
as a Set of userspace management tools. The file system 
is a Stackable file system and is an extended version of 
UnionFS [19]. We present experimental results using our 
Strata Linux prototype to manage various VAs, demon- 
Strating its ability to reduce management costs while 
incurring only modest performance overhead. Experi- 
ments were conducted on VMware ESX 3.0 running on 
an IBM BladeCenter with 14 IBM HS20 eServer blades 
with dual 3.06 GHz Intel Xeon CPUs, 2.5 GB RAM, 
and a Q-Logic Fibre Channel 2312 host bus adapter con- 
nected to an IBM ESS Shark SAN with 1 TB of disk 
space. The blades were connected by a gigabit Ether- 
net switch. This is a typical virtualization infrastructure 
in an enterprise computing environment where all vir- 
tual machines are centrally stored and run. We compare 
plain Linux VMs with a virtual block device stored on 
the SAN and formatted with the ext3 file system to VMs 
managed by Strata with the layer repository also stored 
on the SAN. By storing both the plain VM’s virtual block 
device and Strata’s layers on the SAN, we eliminate any 
differences in performance due to hardware architecture. 

To measure management costs, we quantify the time 
taken by two common tasks, provisioning and updating 
VAs. We quantify the storage and time costs for pro- 
visioning many VAs and the performance overhead for 
running various benchmarks using the VAs. We ran ex- 
periments on five VAs: an Apache web server, a MySQL 
SQL server, a Samba file server, an SSH server provid- 
ing remote access, and a remote desktop server provid- 
ing acomplete GNOME desktop environment. While the 
server VAs had relatively few layers, the desktop VA has 
very many layers. This enables the experiments to show 
how the VLFS performance scales as the number of lay- 
ers increases. To provide a basis for comparison, we pro- 
visioned these VAs using (1) the normal VMware virtu- 
alization infrastructure and plain Debian package man- 


Apache; MySQL| Samba | SSH 


Desktop 





Plain 3554 
Strata 0.002s | 0.002s 0.002s | 0.002s | 0.002s 
QCOW2)| 0.003s | 0.003s 0.003s | 0.003s | 0.003s 


Table 1: VA Provisioning Times 


agement tools, and (2) Strata. To make a conservative 
comparison to plain VAs and to test larger numbers of 
plain VAs in parallel, we minimized the disk usage of 
the VAs. The desktop VA used a 2 GB virtual disk, while 
all others used a | GB virtual disk. 


6.1 Reducing Provisioning Times 


Table 1 shows how long it takes Strata to provision VAs 
versus regular and COW copying. To provision a VA us- 
ing Strata, Strata copies a default VMware VM with an 
empty sparse virtual disk and provides it with a unique 
MAC address. It then creates a symbolic link on the 
shared file system from a file named by the MAC address 
to the layer definition file that defines the configuration 
of the VA. When the VA boots, it accesses the file de- 
noted by its MAC address, mounts the VLFS with the 
appropriate layers, and continues execution from within 
it. To provision a plain VA using regular methods, we 
use QEMU’s gemu-—img tool to create both raw copies 
and COW copies in the QCOW?2 disk image format. 

Our measurements for all five VAs show that using 
COW copies and Strata takes about the same amount of 
time to provision VAs, while creating a raw image takes 
much longer. Creating a raw image for a VAs takes 3 to 
almost 6 minutes and is dominated by the cost of copy- 
ing data to create a new instance of the VA. For larger 
VAs, these provisioning times would only get worse. In 
contrast, Strata provisions VAs in only a few millisec- 
onds because a null VMware VM has essentially no data 
to copy. Layers do not need to be copied, so copying 
overhead is essentially zero. While COW images can 
be created in a similar amount of time, they do not pro- 
vide any of the management benefits of Strata, as each 
new COW image is independent of the base image from 
which it was created. 


6.2 Reducing Update Times 


Table 2 shows how long it takes to update VAs us- 
ing Strata versus traditional package management. We 
provisioned ten VA instances each of Apache, MySQL, 
Samba, SSH, and Desktop for a total of 50 provisioned 
VAs. All were kept in a suspended state. When a se- 
curity patch was made available for the tar package 
installed in all the VAs, we updated them [18]. Strata 
simply updates the layer definition files of the VM tem- 
plates, which it can do even when the VAs are not active. 
When the VA is later resumed during normal operation, 
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Figure 6: Storage Requirements 


it automatically checks to see if the layer definition file 
has been updated and updates the VLFS namespace view 
accordingly, an operation that is measured in microsec- 
onds. To update a plain VA using normal package man- 
agement tools, each VA instance needs to be resumed and 
put on the network. An administrator or script must ssh 
into each VA, fetch and install the update packages from 
a local Debian mirror, and finally re-suspend the VA. 

Table 2 shows the total average time to update each 
VA using traditional methods versus Strata. We break 
down the update time into times to resume the VM, get 
access to the network, actually perform the update, and 
re-suspend the VA. The measurements show that the cost 
of performing an update is dominated by the manage- 
ment overhead of preparing the VAs to be updated and 
not the update itself. Preparation is itself dominated by 
getting an IP address and becoming accessible on a busy 
network. While this cost is not excessive on a quiet net- 
work, on a busy network it can take a significant amount 
of time for the client to get a DHCP address, and for the 
ARP on the machine controlling the update to find the 
target machine. The average total time to update each 
plain VA is about 73 seconds. In contrast, Strata takes 
only a second to update each VA. As this is an order 
of magnitude shorter even than resuming the VA, Strata 
is able to delay the update to a point when the VA will 
be resumed from standby normally without impacting its 
ability to quickly respond. Strata provides over 70 times 
faster update times than traditional package management 
when managing even a modest number of VAs. Strata’s 
ability to decrease update times would only improve as 
the number of VAs being managed grows. 





Table 2: VA Update Times 
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6.3 Reducing Storage Costs 


Figure 6 shows the total storage space required for dif- 
ferent numbers of VAs stored with raw and COW disk 
images versus Strata. We show the total storage space 
for | Apache VA, 5 VAs corresponding to an Apache, 
MySQL, Samba, SSH, and Desktop VA, and 50 VAs cor- 
responding to 10 instances of each of the 5 VAs. As ex- 
pected, for raw images, the total storage space required 
grows linearly with the number of VA instances. In con- 
trast, the total storage space using COW disk images and 
Strata is relatively constant and independent of the num- 
ber of VA instances. For one VA, the storage space re- 
quired for the disk image is less than the storage space 
required for Strata, as the layer repository used contains 
more layers than those used by any one of the VAs. In 
fact, to run a single VA, the layer repository size could 
be trimmed down to the same size as the traditional VA. 

For larger numbers of VAs, however, Strata provides 
a substantial reduction in the storage space required, be- 
cause many VAs share layers and do not require dupli- 
cate storage. For 50 VAs, Strata reduces the storage 
space required by an order of magnitude over the raw 
disk images. Table 3 shows that there is much dupli- 
cation among statically provisioned virtual machines, as 
the layer repository of 405 distinct layers needed to build 
the different VLFSs for multiple services is basically the 
same size as the largest service. Although initially Strata 
does not have an significant storage benefit over COW 
disk images, as each COW disk image is independent 
from the version it was created from, it now must be 
managed independently. This increases storage usage, as 
the same updates must be independently applied to many 
independent disk images 


6.4 Virtualization Overhead 


To measure the virtualization cost of Strata’s VLFS, 
we used a range of micro-benchmarks and real appli- 
cation workloads to measure the performance of our 
Linux Strata prototype, then compared the results against 
vanilla Linux systems within a virtual machine. The vir- 
tual machine’s local file system was formatted with the 
Ext3 file system and given read-only access to a SAN 
partition formatted with Ext3 as well. We performed 
each benchmark in each scenario 5 times and provide the 
average of the results. 







Repo Apache} MySQL | Samba | SSH Desktop 
1.8GB | 217MB | 206MB | 169MB |} 127MB| 1.7GB 
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To demonstrate the effect that Strata’s VLFS has on 
system performance, we performed a number of bench- 
marks. Postmark [7], the first benchmark, is a synthetic 
test that measures how the system would behave if used 
as a mail server. Our postmark test operated on files be- 
tween 512 and 10K bytes, with an initial set of 20,000 
files, and performed 200,000 transactions. Postmark is 
very intensive on a few specific file system operations 
such as lookup (), create(), and unlink (), be- 
cause it is constantly creating, opening, and removing 
files. Figure 7 shows that running this benchmark within 
a traditional VA is significantly faster than running it in 
Strata. This is because as Strata composes multiple file 
system namespaces together, it places significant over- 
head on those namespace operations. 

To demonstrate that postmark’s results are not indica- 
tive of application oriented performance, we ran two 
application benchmarks to measure the overhead Strata 
imposes in a desktop and server VA scenario. The 
first benchmark was a multi-threaded build of the Linux 
2.6.18.6 kernel with two concurrent jobs using the two 
CPUs allocated to the VM. In all scenarios, we added the 
8 software layers required to build a kernel to the layers 
needed to provide the service. Figure 7 shows that while 
Strata imposes a slight overhead on the kernel build com- 
pared to the underlying file system it uses, the cost is 
minimal, under 5% at worst. 

The second benchmark measured the amount of HTTP 
transactions that were able to be completed per second to 
an Apache web server placed under load. We imported 
the database of a popular guitar tab search engine and 
used the http_load [13] benchmark to continuously 
performed a set of 20 search queries on the database 
until 60,000 queries in total have been performed. For 
each case that did not already contain Apache, we added 
the appropriate layers to the layer definition file to make 
Apache available. Figure 7 shows that Strata imposes a 
minimal overhead of only 5%. 


While the Postmark benchmark demonstrated that the 
VLFS is not an appropriate file system for workloads that 
are heavy with namespace operations, this shouldn’t pre- 
vent Strata from being used in those scenarios. No file 
system is appropriate for all workloads and no system 
has to be restricted to simply using one file system. One 
can use Strata and the VLFS to manage the system’s con- 
figuration while also providing an additional traditional 
file system on a seperate partition or virtual disk drive 
to avoid all the overhead the VLFS imposes. This will 
be very effective for workloads, such as the mail server 
Postmark is emulating, where namespace heavy opera- 
tions, such as a mail server processing its mail queue, 
can be kept on a dedicated file system. 


7 Conclusions and Future Work 


Strata introduces a new and better way for system admin- 
istrators to manage virtual appliances using virtual lay- 
ered file systems. Strata integrates package management 
semantics with the file system by using a novel form of 
file system unioning enable dynamic composition of file 
system layers. This provides powerful new management 
functionality for provisioning, upgrading, securing, and 
composing VAs. VAs can be quickly and simply provi- 
sioned as no data needs to be copied into place. VAs can 
be easily upgraded as upgrades can be done once cen- 
trally and applied atomically, even for a heterogeneous 
mix of VAs and when VAs are suspended or turned off. 
VAs can be more effectively secured since file system 
modifications are isolated so compromises can be eas- 
ily identified. VAs can be composed as building blocks 
to create new systems since file system composition also 
serves as the core mechanism for creating and maintain- 
ing VAs. We have implemented Strata on Linux by pro- 
viding the VLFS as a loadable kernel modules, but with- 
out requiring any source code level kernel changes, and 
have demonstrated how a Strata can be used in real life 
situations to improve the ability of system administra- 
tors to manage systems. Strata significantly reduces the 
amount of disk space required for multiple VAs, and al- 
lows them to be provisioned almost instantaneously and 
quickly updated no matter how many are in use. 

While Strata just exists as a lab prototype today, there 
are few steps that could make it significantly more de- 
ployable. First, our changes to UnionFS should either be 
integrated with the current version of UnionFS or with 
another unioning file system. Second, better tools should 
be created for managing the creation and management of 
individual layers. This can include better tools for con- 
verting layers from existing Linux distributions as well 
as new tools that enable layers to be created in a way 
that takes full advantage of Strata’s concepts. Third, the 
ability to to integrate Strata’s concepts with cloud com- 
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puting infrastructures, such as Eucalyptus, should be in- 
vestigated. 
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Abstract 


Starting /stopping a whole cluster or a part of it is 
a real challenge considering the different commands 
related to various device types and manufacturers, 
and the order that should be respected. ‘This arti- 
cle presents a solution called the sequencer that al- 
lows the automatic shutting down and starting up 
of clusters, subset of clusters or even data-centers. 
It provides two operation modes designed for ease 
of use and emergency conditions. Our product has 
been designed to be efficient and it is currently used 
to power on and power off one of the largest cluster 
in the world: the ‘Tera-100, made of more than 4000 
nodes. 

Keywords: emergency power off, start /stop 
procedure, actions sequencing, workflow, 
planning, cluster management, automation, 
case study. 


1 Introduction 


Emergency Power Off (EPO) is often not consid- 
ered from a system administration point of view in 


traditionnal clusters!. Even for Tier-IV infrastruc- 


‘In this article, we consider clusters because they are the 
first target of our solution. However, this solution also applies 
to general data-centers. 


tures [1], EPO may happen for various reasons. In 
such cases, stopping (or starting, both cases are ad- 
dressed) macro components such as a whole rack or a 
rack set requires an appropriate sequence of actions. 
Considering the vast number of different components 
a cluster is composed of: 


nodes: compute nodes’, login nodes, management 
nodes, io (nfs, lustre, ...) nodes, ... 


hardware: power switches (also called Power Dis- 
tribution Units or PDUs), ethernet switches, in- 
finiband [2] switches, cold doors?, disk arrays, 


powering on/off a whole set of racks can be a real 
challenge. 

First, since it is made of a set of heterogeneous 
devices, starting/stopping each component of a clus- 
ter is not straightforward: usually each device type 
comes with its own poweron/off command. For ex- 
ample, shutting down a node can be as simple as an 


?From a hardware perspective, a node in a cluster is just a 
computer. A distinction is made however between nodes de- 
pending on their roles in the cluster. For example, user might 
connect to a login node for development, and job submission. 
The batch scheduler runs on the management node and dis- 
patch jobs to compute nodes. Compute nodes access storage 
through io nodes and so on. 

3A cold door is a water-based cooling system produced by 
Bull that allows high density server in the order of 40 kW per 
rack, 
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’ssh host /sbin/halt -p’. However, it might be 
preferable to use an out of band command through 
the IPMI BMC? if the node is unresponsive for ex- 
ample. Starting a node can be done using a wake on 
lan [3] command or an IPMI [4] command. Some de- 
vices cannot be powered on/off remotely (infiniband 
or ethernet switches for example). Those devices 
might be connected to manageable Power Distribu- 
tion Units (PDUs) that can remotely switch their 
outlets on/off using SNMP [5] commands. On the 
extreme case, manual intervention might be required 
to switch on/off the electrical power. 

For software components, there is also a need 
to manage the shutdown of multiple components 
on different nodes. Considering high availability 
framework, virtualization, localization, and clients, 
using standard calls to /etc/init.d/service 
[start |stop] is often inappropriate. 

Finally, the set of instructions for the powering 
on/off of each cluster’s components should be or- 
dered. ‘Trivial examples include: 


e powering off an ethernet switch too soon may 
prevent other components, including nodes, from 
being powered off; 


e powering off a cold door should be done at the 
very end to prevent cooled components from be- 
ing burnt out. 


By the way, this ordering problem is not only relevant 
to hardware devices. A software component can also 
require that a sequence of instructions is executed 
before being stopped. As a trivial example, when 
an NFS daemon is stopped, one may prefer that all 
NFS clients unmount their related directories first in 
order to prevent either the fill of syslog with NFS 
mount error (when NFS mount option is ’soft’) or 
the load average brutal increase due to the freezing 
of softwares accessing the NFS directories (when NFS 
mount option is *hard’). 

Therefore, in this article, the generic term ’com- 
ponent’ may define a hardware component such as a 
node, a switch, or a cold door, or a software compo- 
nent such as a lustre server or an NFS server. 


*Baseboard Management Controller. 
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Our proposition — called the sequencer — ad- 
dresses the problem of starting/stopping a cluster (or 
a data-center). Its design takes into account emer- 
gency conditions. ‘Those conditions impose various 
constraints addressed by our solution: 


e Predictive: an EPO should have been validated 
before being used. It should not perform un- 
known actions. 


e Easy: an EPO should be easy to launch. The 
emergency cause may happen at any time, espe- 
cially when skilled staff is not present. There- 
fore, the EPO procedure should be launchable 
by “unskilled” humans. 


e Fast: an EPO should be as fast as possible. 


e Smart: an EPO should power off each compo- 
nent of a cluster in the correct order so most 
resources will be preserved. 


e Robust: an EPO should be tolerant to failure. 
For example, if a shutdown on a node cooled by 
a cold door returned an error, the corresponding 
cold door should not be switched off to prevent 
the burnout of the node. On the other side, the 
rest of the cluster can continue the EPO process. 


This article is organized as follow: section 2 exposes 
the design of our solution while some implementation 
details are dealt with in section 3 following by scala- 
bility issues in section 4. Some results of our initial 
implementation are given section 5. Section 6 com- 
pares our solution to related works. Finally, section 7 
presents future works. 


2 Design 


Three major difficulties arise when considering the 
starting/stopping of a cluster or of a subset of it: 


1. the computing of the dependency graph between 
components (power off a cold door after all com- 
ponents of the related rack have been powered 
off); 
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2. the defining of an efficient (scalable) instructions 
sequence where the order defined by the depen- 
dency graph is respected (powering off nodes 
might be done in parallel); 


3. the execution of the instructions set itself, taking 
failure into account properly (do not power off 
the cold door, if related rack’s nodes have failed 
to power off). 


Therefore, the sequencer is made of three distinct 
functional layers: 


Dependency Graph Maker (DGM): this layer 
computes the dependency graph according to de- 
pendency rules defined by the system adminis- 
trator in a database. 


Instructions Sequence Maker (ISM): this layer 
computes the instructions sequence that should 
be executed for starting/stopping the given list 
of components and that satisfies dependency 
constraints defined in the dependency graph 
computed by the previous layer. 


Instructions Sequence Executor (ISE): this 
layer executes the instructions sequence com- 
puted by the previous layer and manages the 
handling of failures. 


Finally, a chaining of those layers is supported 
through an “all-in-one’ command. 

This design provides therefore two distinct pro- 
cesses for the starting/stopping of components: 


Incremental Mode: in this mode, each stage is run 
separately. The output of each stage can be ver- 
ified and modified before being passed to the 
next stage as shown on figure 2.1. ‘The incre- 
mental mode generates a script from constraints 
expressed in a database table and from a com- 
ponents list. This script is optimized in terms 
of scalability and is intepretable by the instruc- 
tions sequence executor that deals with paral- 
lelism and failures. This mode is the one de- 
signed for emergency cases. The instructions set 
computed should be validated before being used 
in production. 








Dependency 
Graph Maker 


Sequence Maker 


Instructions 
Sequence 
Executor 














Figure 2.1: Incremental Mode: each stage output can 
be verified and modified before being passed to the 


next one. 


Figure 2.2: Black Box Mode: using the sequencer for 


simple non-critical usage. 


Sequence 


Black Box Mode: in this mode, illustrated in fig- 
ure 2.2, chaining feature is used to start/stop 
components as shown by the following syntax: 
# clmsequencer \ 
stop \ # ruleset name 
colddoor3 node[100-200] # components list 
This command is somewhat equivalent to the fol- 
lowing: 

# clmsequencer \ 


# command name 


# command name 
depmake \ 

stop \ 

colddoor3 node[1i00-200] \ 
|clmsequencer seqmake \ 


# dgm stage 
# ruleset name 
# components list 
# 18m stage 
# 18e stage 
and can therefore be seen as a syntactic sugar. 


|clmsequencer seqexec 


The computation of the dependency graph and of 
the instruction set can take a significant amount of 
time, especially on very large clusters such as the 
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Tera-100°. This is another good reason for choosing 
the incremental mode in emergency conditions where 
each minute is important. 


3 Implementation 


3.1 Dependency Maker 


(DGM) 


The Dependency Graph Maker (DGM) is the first 
stage of the sequencer. It takes a components list 
in parameter, and produces a dependency graph in 
output. It uses a set of dependency rules described 
in a database table. The CLI has the following usage: 

# clmsequencer depgraph [--out file] 
ruleset cl_i...cl_N 

The output is a human readable description of the 
computed dependency graph in XML format that the 
Instructions Sequence Maker can parse. By default, 
the computed dependency graph is produced on the 
standard output. 

The --output file option allows the computed 
dependency graph to get written in the specified 
file. 

The ruleset parameter defines which ruleset 
should be used to compute the dependency graph. 
Ruleset will be explained in section 3.1.2 on the se- 
quencer table. 

Finally, other parameters cl_1...cl_N define on 
which components the dependency graph should be 
computed. Each parameter describes a list of com- 
ponent in a specific format describes in next sec- 
tion 3.1.1. 


Graph 





3.1.1 Components list specification 


The first stage of the sequencer takes as an input a 
list of components. This list is of the form: 


prefixla-b,c-d,...] l#type] [@category] 


where: 


>Tera-100 is ranked #6 in the Top500 november 2010 list 
of fastest supercomputers in the world and #1 in Europe. It 
is composed of several thousands of Bull bullx series S servers. 
See http://www.top500.org/ for details. 
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prefix|a-b,c-d,...]: is the standard contracted nota- 
tion for designing a set of names prefixed by 
’prefix’ and suffixed by a number taken in 
the range given by intervals [a — 6], [c — d], and 
so on. For example, computel1-3,5,7-8] de- 
fines names: computel, compute2, compute3, 
computed, compute/, computes. 


category: is optionnal and defines the table® where 
given names should be looked for their type 
(if not given). The type of a component is 
used in the definition of the dependency table 
as described in section 3.1.2. Category exam- 
ples (with some related types) are: node (io, 
nfs, login,compute), hwmanager (bmc, cmc, 
coldoor’) and soft (nfsd, nagios, sshd). 


Some examples of full component list names are given 
below: 


R-[1-3]#io@rack: the io racks R-1, R-2 and R-3; 


bullx[10-11]#mds@node: the lustre mds _ node 
bullx10 and bullx11; 
colddoorl#coldoor@hwmanager: the cold door 


numbered 1; 
esw-l#eth@switch: the ethernet switch esw-1; 


server [1-2]#nfsd@soft: NFS daemons running on 
serverl and server2. 


3.1.2 Sequencer Dependency Rules: the se- 
quencer table 


The Dependency Graph Maker requires dependency 
rules to be specified in a database table. ‘This table 
describes multiple sets of dependency rules. A ruleset 
is defined as a set of dependency rules. For example, 
there is one ruleset called smartstart containing all 


Tt is considered a good practice to have a database where 
the cluster is described. In a bullx cluster, each component 
is known and various informations are linked to it such as its 
model, its status, its location and so on. There should be a 
way to find a type from a component name. In this article, we 
use a database for that purpose, any other means can be used 
though. 

7Cold doors are spelled ’coldoor’ in bullx cluster database. 
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the dependency rules required for the starting of com- 
ponents. Another ruleset containing all dependency 
rules required for the stopping of components would 
be called smartstop. 

The format of this table is presented below. One 
line in the table represents exactly one dependency 
rule. ‘Table columns are: 


ruleset: the name of the ruleset this dependency 
rule is a member of. 


name: the rule name, this name is used as a reference 
in the dependson column, it should be unique 
in the ruleset; 


types: the component types the rule should be ap- 
plied to. A type is specified using the full name 
(that is, ’type@category’). Multiple types 
should be separated by the "pipe" symbol as 
in compute@node|io@node. The special string 
?ALL? acts like a joker: ’ALL@node’ means 
any component from table node matches, while 
?ALL@ALL’ means any component matches, and 
is equivalent to ’ALL’ alone. 


filter: an expression of the following two forms: 


~e 


e {var =~ regexp 


e {var !~ regexp 


where ’/var’ is a variable that will be replaced 
by its value on execution (see table 1 for the 
list of available variables). The operator ’=~? 
means that component will be filtered in only if 
a match occurs while ’?!~? means the component 
will be filtered in only if a match does not oc- 
cur (said otherwise, it a match occurs, it will be 
filtered out). 


If the expression does not start with a known 
»/var’ then, the expression is interpreted as a 
(shell) command that when called specifies if the 
given component should be filtered in (returned 
code is 0) or out (returned code is different than 
0). Variables will also be replaced before com- 
mand execution, if specified. As an example, to 
filter out any component which name starts with 
the string ’bul1x104’, one would use: ’/Zname 


=~ ~bullx104’. On the other side, to let a 
script decide on the component id, one would 
use: ?/usr/bin/my_filter id’. 


Finally, two special values are reserved for spe- 
cial meanings here: 


e String ’ALL’: any component is filtered in 
(i.e. accepted); 


e ‘The ’NULL’ special DB value: any compo- 
nent is filtered out (i.e. refused). 


action: the (shell) command that should be exe- 
cuted for each component that matches the rule 
type (and that have been filtered in). Variables 
will be replaced, if specified (see table 1 for the 
list of available variables). If the action is pre- 
fixed with the ’@’ symbol, the given action will 
be executed on the component using an ’ssh’ in- 
ternal connexion. Depending on the action exit 
code, the Instruction Sequence Executor may 
continue its execution, or abort. This will be 
discussed in section 3.3. 


depsfinder: the (shell) command that specifies 
which components the current component de- 
pends on. The command should return the com- 
ponents set on its standard output, one compo- 
nent per line. A component should be of the 
following format: ’name#type@category’. Vari- 
ables will be replaced, if specified (see table 1 for 
the list of available variables). When set to the 
’NULL’ special DB value, rule names specified in 
the dependson column are simply ignored. 


dependson: a comma-separated list of rule names, 
this rule depends on. For each dependency re- 
turned by the depsfinder, the sequencer looks 
if the dependency type matches one of the rule 
type specified by this field (rule names specified 
should be in the same ruleset). If such a match 
occurs, the rule is applied on the dependency. 
When set to the ’NULL’ special DB value, the 
script specified in the ’depsfinder’ column is 
simply ignored. 


comments: a free form comment. 
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%category The category of the component 
The current ruleset being processed 
The current rule being processed 





Table 1: List of available variables. 


2. Dependency Graph Creation: the depen- 
dency graph is created as a set of disconnected 
nodes where each node is taken from the list 
of ids. A node in the dependency graph has 
the following form: ’id [actionsList]’ where 
?actionsList’ is the list of actions that should 
be executed for the component with the corre- 
sponding ’id’. ‘This graph is updated during the 
process by: 


The framework does not allow the specification of a 
timeout for a given action for two main reasons: 


1. Granularity: if such a specification was provided 
at that level (in the sequencer table), the time- 
out would be specified for all components type 
specified by the ’type’ column whereas it seems 
preferable to have a lower granularity, per com- 
ponent. This is easily achievable by the ac- 
tion script itself for which the component can 


be given as a parameter. 
(a) Node additions: when processing a given 


component ’c’, through a dependency 
rule (one row in the related ruleset ta- 
ble), the command line specified by column 
’depsfinder’ is executed. This execution 
may return a list of components that should 
be processed before the current one. Each 
of those components will therefore be added 
to the dependency graph if it is not already 
present. 


2. The action script, for a given component, knows 
what to do when a timeout occurs much better 
that the sequencer itself. Therefore, if a spe- 
cific process is required after a command time- 
out (such as a retry), the action script should 
implement itself the required behavior when the 
timeout occurs and returns the appropriate re- 
turn code. 


As an example we will consider the sequencer table 


presented in table 2. (b) Arc additions: for each components re- 
turned by the ’depsfinder’ script, an arc 
is added between ’c’ and that returned 
component; 


3.1.3 Algorithm 


The objective of the Dependency Graph Maker is to 


output the dependency graph based on the depen- (c) Node modification: when processing a 
dency rules defined in the related table and on the given component, the content of the column 
components list given as a parameter. ‘The comput- 2action’ of the ruleset table of the related 
ing of the dependency graph involves the following dependency rule is added to the node ac- 
steps: tions list. 


1. Components List Expansion: from the 
given components list, the expansion should be 
done. It returns a list of names of the form: 


3. Rules Graph Making: from the dependency 
rules table, and for a given ruleset, the corre- 
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’name#type@category’. Such name is called id 
in the following. 
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sponding graph — called the rules graph — is cre- 
ated. A node in that graph is a pair (s,t) where 


USENIX Association 


45 


‘a]qe} Jeouenbes e jo ojdurexe UV :% 9[qey], 








ale 
"Sapou qnduwos 
sioleq Aoop sd w= 
plea uo aamog JIeJSsAIoOpTo> ap ucapou putz SWUeUZ UoTamod TAalDepou SUeUs Sspouge induos ugapou 12P15 
JUSTTO yoes 
1Joj sbesssu 
BUTUTeEM |; paqunouw a 
e qUuTAd ANON ANON SAN (ONINYYM ole TTWY JOS San Junoun SaNqunoun doqs 


LISA ’11: 25th Large Installation System Administration Conference 





‘Bugjey 
sJojaq Auadaud 
SJU UMOpPINYS pPue sueus sda SUPUs Spoug syu 
Ajuesj> junowuny UMOgSsIL p Jyosapou puTy Jjoramod [aq2spou TI¥ |epoupjeaqnduos Jjoeapou dois 





USENIX Association 


46 


coldoorOff 


nodeOff 


umountNFS 





Figure 3.1: Rules Graph of the ’stop’ ruleset defined 
in the sequencer table 2. 


s is the rule symbolic name, and t is the com- 
ponent types defined by the ’types’ column in 
the dependency rule table. This graph is used 
for the selection of a component in the compo- 
nents list to start the dependency graph update 
process. From table 2, the rules graph of the 
stop ruleset is shown figure 3.1. Note that cy- 
cles are possible in this graph. As an example, a 
PDU (related to a switch type in the sequencer 
table) that connects (power) an ethernet switch 
which itself connects (network) a PDU. 


4. Updating the Dependency Graph: from 
each ids (resulting from the expansion of each 
initial components list), the corresponding rule 
in the given ruleset of the sequencer table should 
be found. For that purpose, a potential root 
is looked for in the ids set. A potential root 
is an id that matches one root® in the rules 
graph. A match between an id of the form 
’name#type@category’ and a rule ’R’ occurs 
when type is in ’R.types’ and when id has 
been filtered in. If such a match cannot be found, 
then, a new rules graph is derived from the pre- 


SA node in the graph with no parent. 
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ceding one by removing all roots and their re- 
lated edges. Then, the root finding is done on 
that new graph, and so on recursively until ei- 
ther: 


e the rule graph is empty: in this case, 
the given components list cannot be 
started /stopped (entirely); 


e the rule graph is only made of cycles: any 
id can be used as the starting point; 


e a match occurs between ’id’ and ’R?: in 
this case, the dependency graph is updated 
from the application of rule ’R’ to ’id?’. 
Each time such an application is made, 
?id’ is removed from the initial id set. 


As an example, consider a rack cooled by a bullx cold 
door ’cd0’ containing an NFS server ’nfs1’ and 
a compute node ’c1’. Consider also another NFS 
server ’nfs2’ that is not in the same rack. We also 
suppose that: 


e ’ci’ is client of both ’nfs1’ and ’nfs2’; 
e ’nfsi’ is client of ’nfs2’; 
e ’nfs2’ is client of ’nfs1’?. 


Using table 2, and the component 
’nfsi#nfsd@soft, cd0, nfs2’, objectives are: 


list: 


e power off ’c1’ and ’nfsi’ before ’cd0’, be- 
cause powering off a cold door requires that each 
equipement cooled are powered off first; 


e stop NFS daemons on ’nfsi’ because it is re- 
quested (this should be done before powering off 
’nfsi’); 

e power off ’nfs2’ because it is requested (but the 


NFS daemon will have to be stopped before); 


e for each NFS client! a warning should be writ- 
ten before the actual stopping of used NFS 
server. 


°Yes, it might seem strange here. This serves the purpose 


of our example. 

'0One might use the content of /var/lib/nfs/rmtab for an 
(inaccurate) list of NFS clients, the ’showmounts’ command 
or any other means. 
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nfs1#nfsd@soft| |cd0#coldoor@hwmanager nfs2#nfs@node| 


Figure 3.2: The initial dependency graph. 


nfs1#nfsd@soft} |cdO#coldoor@hwmanager) |nfs2#nfs@node 


nodeOff nodeOff 








t t 


Figure 3.3: The dependendy graph after the call to 
the ?cd0#coldoor@hwmanager’ depsfinder. 


With the table, the ruleset and the component 
list, the sequencer starts to expand the com- 
ponent list into an id set: ?’nfsi#nfsd@soft, 
cd0#coldoor@hwmanager, nfs2#nfs@node’. ‘Then 
the dependency graph is initialized: each id in the 
set has a related node in the graph as shown in fig- 
ure 3.2. 

Then the sequencer looks for a potential root 
using the rules graph (shown on figure 3.1). 
A match exists between rule ’coldoor0ff’ and 
>cdO#coldooor@hwmanager’. ‘Therefore, the se- 
quencer starts applying rule ’coldoorOff’ to 
>cd0#coldooor@hwmanager’. The depsfinder of the 
rule is called. For each id returned by the deps- 
finder!!, the graph is updated: a node with an empty 
action list is made and an edge from the current id to 
the dependency is created. Each returned id is added 
to the id set. 

Then, for each dependency, the sequencer checks 
if a match exists with one of the rules defined in the 
>dependson’ column, and for each match, the match- 
ing rule is applied recursively. 

In our case, the cold door depsfinder returns 
every cooled component: ?’nfsil#nfs@node’ and 
’cl#compute@node’. ‘Therefore, the graph is up- 
dated as shown in figure 3.3. The ’coldoor0ff’ rule 
defines a single dependency in its ’dependson’ col- 


‘lNote that it is not required that depsfinder returns ids 
with a predefined category as soon as a match occurs in the 
sequencer table. Predefined categories are used to ease the 
mapping between a given component and a type. In a large 
cluster (or data-center), it may not be easy to determine what 
is the real type of a given component name. 
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cd0#coldoor@hwmanager 


[bsm_power -a off cd0] 


y 


nfs1#nfsd@soft nfs2#nfs@node 
[@/etc/init.d/nfs stop] 





y 
|_“Meampss@nr* | 





Figure 3.4: The dependency graph after 
the application of — rule ’coldoorDff’ on 
>cdO#coldoor@hwmanager’. 


umn: ’nodeOff’. Both components match, the rule 
is applied. ‘The application of the rule ’nodeOff’ 
on ’c1#compute@node’ leads to the execution of the 
depsfinder which does not return anything. There- 
fore, the application of the rule ends by adding the 
action ’nodectrl poweroff name’ to the related 
node in the dependency graph and by the removal of 
the related id from the id set. 

This implies that a given rule is applied at most 
once on a given id. 

The sequencer continues with the next depen- 
dency which is ’nfsl#nfs@node’. ‘The applica- 
tion of the rule ’nodeOff’ leads to the execution 
of the depsfinder which returns ’nfsi#nfsd@soft’. 
This node is already in the graph (it is in 
the initial id set). Therefore, the depen- 
dency graph is just updated with a new edge. 
This id matches the dependency rule specified 
*nfsDown’ and this last rule is applied on that 
id. The depsfinder on ’nfsi#nfsd@soft’ returns all 
known clients which are ’ci#unmountNFS@soft’? and 
’nfs2#unmountNFS@soft’. 

Finally both dependencies match the rule 
?>umountNFS’ but its application does not lead to 
any new node in the dependency graph. However, 
the graph is updated so each node is mapped to 
its related action, recursively, up to the node the 
sequencer started with: ’?cd0#coldoor@hwmanager’ 
as show on figure 3.4. 

At that stage, the id set contains only the 
last element from the originial component list: 
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nodeOff 


umountNFS 


Figure 3.5: The rules graph with first level roots re- 
moved. 


’nfs2#nfs@node’ (others were removed by preced- 
ing rule applications). Unfortunately that id does 
not match any root rule in the rules graph. ‘Thus, 
the rules graph is (virtually) modified so roots are 
removed. This leaves us with the rules graph shown 
on figure 3.5. 

From that graph, id ’nfs2#nfs@node’ matches 


root rule ’nodeOff’ which is therefore applied. 
The  depsfinder returns nfs2#nfsd@soft’ 
which is new and therefore added in the de- 


pendency graph. ‘The rule ’nfsDown’ is applied 
on that id (since a match occurs) giving us 
two dependencies ’ci#unmountNFS@soft’ and 
’nfsl#unmountNFS@soft’. 

The algorithm ends after the mapping of those new 
ids with their related actions as shown in the final 
graph shown on figure 3.6. 

Remember that a rule is never applied twice 
on a given id. Therefore, the action from 
rule ’unmountNFS’ which is ’echo WARNING: NFS 
mounted!’ on id ’cl#unmountNFS@soft’ is not 
added twice. 

The sequencer displays this dependency graph 
in an XML format (using the open-source python- 
graph library available at http://code.google. 
com/p/python-graph/) on its standard output. This 
output can be given directly to the second stage of 
the sequencer. Note that contrary to the rules graph, 
the dependency graph should not contain a cycle (this 
will be detected by the next stage and refused as an 
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cd0#coldoor@hwmanager 


[bsm_power -a off_force cd0] 








nfs2#nfs@node 
[nodectrl poweroff nfs2] 


nfs1#nfsd@soft 
[@/etc/init.d/nfs stop] 








Figure 3.6: The final dependency graph. 


input). 


3.2 Instructions Sequence Maker 


(ISM) 


The Instructions Sequence Maker (ISM) is the second 
stage of the sequencer. Its role is to transform a de- 
pendency graph into a set of instructions that can be 
given as an input to the third stage, the Instructions 
Sequence Executor (ISE). 

A set of instructions is specified as an XML doc- 
ument, within an <instructions> XML tag. Three 
kind of instructions can be specified: 


Action: defined by the <action> tag. It specifies 
the actual command that should be executed. 
Attributes are: 


e ’id’: Each action should be identified 
by a unique string. This attribute is 
mandatory. It is usually’? of the form 
*name#type@category!rule’ 


e ’deps’: a list of ids this action depends 
on (explicit dependencies). This attribute 
is optionnal. Default is the empty string. 


e ’remote’: the command should be exe- 
cuted using the current shell unless this at- 
tribute is set to ’true’. In this case, an 

12Tt is not required for the instructions sequence XML doc- 


ument to be created by the Instructions Sequence Maker. It 
may be created/modified by hand or by any other programs. 
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internal ssh connexion is made to execute 
the given command on each components de- 
scribed by the ?component_set’ attribute 
(see below). This attribute is optionnal. 
Default is ’false’. 


e ’component_set’: the set of com- 
ponents this action should be ex- 
ecuted on in the following format: 
’name [range]#type@category’. This 


attribute is ignored by the ISE unless 
the remote attribute is set to ’true’. 
This attribute is optionnal. Default is 
?localhost#type@cat’. 


Sequence: defined by the <seqg> tag. It specifies 
a set of instructions (hence, one of Action, Se- 
quence or Parallel) that must be executed in the 
given order. This defines implicit dependencies 
between instructions as opposed to explicit de- 
pendencies defined by the ’deps’ attribute of 
an Action. 


Parallel: defined by the <par> tag. It specifies a set 
of instructions (hence one of Action, Sequence or 
Parallel) that can be executed in any order. This 
explicitly defines that there is no dependency be- 
tween each instruction. ‘The ISE is free to exe- 
cute them in parallel. Note that the ISE may 
or may not execute those instructions in paral- 
lel. This is not a requirement for the successful 
completion of a parallel instruction. 


Transforming a dependency graph into an instruc- 
tions sequence is straightforward if performance is 
not the main goal. A simple topological sort [6] on 
the input dependency graph returns a sequence of 
actions where constraints are respected. 

For example, on our example where the final de- 
pendency graph computed by the DGM is given on 
figure 3.6, a topological sort’? gives the sequence 
shown on sample 1. 


This sequence is valid, but not efficient: it requires 
9 sequential steps. This transformation algorithm is 


'3For a given directed acyclic graph, several valid topological 
sort outputs can be found. 


called ’seq’ in the sequencer and it can be selected. 
Three other algorithms are provided within the se- 
quencer: 


e ’par’: this algorithm inserts each node in the 
dependency graph using a single parallel (<par> 
XML tag) instruction and explicit dependencies 
(’deps’ attribute of the <action> XML tag). 
Such an algorithm is optimal in terms of perfor- 
mance, but it produces an instructions sequence 
file that is difficult to read by a human because 
of all those explicit dependencies. 


e ’mixed’: this algorithm inserts each leaf nodes 
in the dependency graph using a parallel instruc- 
tion, then remove those leaf nodes from the de- 
pendency graph and starts again. Every such 
parallel instructions are included in a global se- 
quence one (<seq> XML tag). This algorithm 
tends to execute set of actions by steps: all leaf 
nodes are executed in parallel. Once they have 
terminated, they are removed from the graph, 
and another batch of leaf nodes are executed in 
parallel up to the end. 


e ’optimal’: this algorithm produces an instruc- 
tions sequence that is as efficient as the ’par’ 
algorithm but much more readable. It uses im- 
plicit dependencies as much as possible!* using 
sequence instructions. ‘This algorithm is selected 
by default. 


Describing in details those algorithms with their ad- 
vantages and constraints is beyond the scope of this 


paper. 


3.3 Instructions Sequence Executor 
(ISE) 


The Instructions Sequence Executor (ISE) is the last 
stage of the sequencer. It takes in input an in- 
structions sequence as computed by the ISM or cre- 
ated /edited by hand or by any other means. It then 
runs the instructions specified taking into account: 


'4Qur XML instructions sequence format can only express 
trees if implicit dependencies are used exclusively. 
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Sample 1 Result of the topological sort on the dependency graph given figure 3.6. 


<instructions> 
<seq> 


<action 
<action 
<action 
<action 

/etc/init.d/nfsd stop 


</action> 


id=’’cl#unmountNFS@soft!unmountNFS’’>echo WARNING: NFS mounted!</action> 
id=’’nfsi#unmountNFS@soft!unmountNFS’’>echo WARNING: NFS mounted!</action> 
id=’’nfs2#unmountNFS@soft!unmountNFS’’>echo WARNING: NFS mounted!</action> 
id="’nfsl#nfsd@node!nfsDown’’ remote=’’true’’ component_set=’’nfisl#nfsd@node’’> 


<action id=’’nfsl#nfsQnode!nodeOff’’>nodectrl poweroff nfsi</action> 
<action id=’’nfs2#nfsd@soft!nfsDown’’ remote=’’true’’ component_set=’’nfs2#nfs@node’’> 


/etc/init.d/nfsd stop 
</action> 


<action id=’’cl#compute@node’’>nodectrl poweroff c1l</action> 
<action id=’’nfs2#nfsQnode!nodeOff’’>nodectrl poweroff nfs2</action> 
<action id=’’cd0#coldoor@hwmanager!coldoor0ff’’>bsmpower -a off cd0</action> 


</seq> 


</instructions> 


e parallelism: actions that do not have dependen- 
cies between them might be executed in paral- 
lel. ‘There is a customizable maximum limit on 
the number of actions that can be executed in 
parallel by the ISE. This helps limiting the load 
increase of the system due to a vast number of 
forks in a large cluster. 


e dependencies: an action is not executed unless 
all its dependencies (explicit and implicit) have 
completed successfully. An executed action is 
considered successful in two cases: 





— its returned code is 0 (alias OK); 


— its returned code is 75 (alias WARN- 
ING also known as EX TEMPFAIL in 
sysexits.h) and the option ’--Force’ has 
been given to the ISE. 


The implementation of the ISE uses the Cluster- 
Shell [7] python library as the backend execution en- 
gine. Describing the implementation of the ISE is 
beyond the scope of this article. 


4 Scalability Issues 


Using the sequencer on a large cluster such as the 
Tera-100 can lead to several issues related to scala- 
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bility. 


4.1 Complexity 


Several complexity in space and time can be identified 
for: 


1. the production of the dependency graph pro- 
duced by the DGM; 


2. the production of the actions graph produced by 
the ISM; 


3. the execution of the actions graph by the ISE. 


This last complexity was our first concern due to our 
customer requirements. If theorical complexity has 
not (yet) been formally determined, the execution 
time of the sequencer for the production of the de- 
pendency graph of the Tera-100 is: 


e 13 minutes 40 seconds for the start ruleset with 
9216 nodes and 8941 edges in the dependency 
graph; 


e 2 minutes 1 second for the stop ruleset with 9222 
nodes and 13304 edges in the dependency graph. 


The time taken by the ISM for the production of the 
actions graph from the previously computed depen- 
dency graph using the ’optimal’ algorithm is: 
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e 4.998 seconds for the start with 4604 nodes and 
8742 edges in the actions graph; 


e 6.343 seconds for the stop with 4606 nodes and 
9054 edges in the actions graph. 


Finally, the time taken by the ISE to execute these 
actions graph is: 


e 4 minutes 27 seconds for the start with: 


— 99.7% of actions executed (successfully or 
not); 
— 6.6% of actions that ends on error for vari- 


OUsS reasons; 


— 0.3% of actions not executed because some 
of their dependencies ends on error or was 
not executed. 


e 9 minutes 23 seconds for the stop ruleset with: 


— 96.7% of actions executed (successfully or 
not); 
— 15.3% of actions that ends on error for var- 


ious reasons; 


— 3.3% of actions not executed because some 
of their dependencies ends on error or was 
not executed 





Explaining differences between those metrics is be- 
yond the scope of this paper. However, from such 
results, the sequencer can be considered has quite 
scalable. 


4.2 Mantainability 


The maintenance of the various graph used by the 
sequencer: 


e rules graph; 
e the DGM produced dependency graph; 


e the ISM produced actions graph XML file; 
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is also an issue on large systems. Identifying wrong 
dependencies in a flat file can be hard, especially 
with large graph represented with several thousands 
of lines. 

The sequencer can exports those graph in the DOT 
format. It therefore delegates to specific graph tools, 
the identification of non trivial problems for mainte- 
nance purposes. For instance, rules graph are usually 
small and the standard ’dot’ command that comes 
within the graphviz [8] open-source standard product 
can be used for their vizualisation. This is fast and 
easy. For other much larger graph, however, special- 
ized tools such as Tulip [9] might be used instead. 





4.3 Usability 


In the context of large systems, giving correct inputs 
to a tool, and getting back a usable output can be a 
big challenge in itself. In the case of the sequencer, 
inputs are the sequencer table and the components 
list. 

For the maintenance of the table, the sequencer 
provides a management tool that helps adding, re- 
moving, updating, copying and even checksuming 
rules. For the components list, the sequencer uses 
what is called a guesser that given a simple compo- 
nent name fetches its type and category. This allows 
the end user to specify only the name of a component. 

Apart from the output produced by the first two 
stages that have already been discussed in the previ- 
ous section on maintanability, the last output of great 
interest for the end-user, is the ISE output. ‘To in- 
crease further the usability of the sequencer, several 
features are provided: 








Prefix Notation: each executed action output is 
prefixed by its id. When the ISE executes an 
action graph produced by previous stages, those 
ids contain various informations such as the type, 
the category, and the rulename this action comes 
from. This helps identifying which action pro- 
duced which output (the bare mininum). More- 
over, such an output can be passed to vari- 
ous filters such as grep or even gathering com- 
mands such as ’clubak’ from ClusterShell [7] 
or ’dshbak’ from pdsh [10]. In the case of 
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>clubak’ command, the separator can be given 
as an option. As a side effect, this allows the 
end-user to group similar output by node, type 
or category. 


Reporting: The ISE can produce various reports: 


e ’model’: each action with their dependen- 
cies (implicit and explicit) are shown; this 
is used to determine what the ISE will 
do before the actual execution (using a 
’--noexec’ option); 


e ’exec’: each executed action is displayed 
along with various timing informations and 
the returned code; 


e ’error’: each executed action that exited 
with an error code is displayed along with 
their reversed dependencies (their parent in 
the dependency graph); this is used to know 
which action has not been executed because 
of a given error; 








e ’unexec’: each non executed action is dis- 
played along with its missing dependencies 
— a dependency that exited with an error 
code and that prevented the action from 
being executed; this is used to know why a 
given action has not been executed. 


5 Results 


The sequencer has been designed for two main pur- 
poses: 


1. Emergency Power Off: this is the reason of the 
three different independent stages; 


2. Common management of resources in clusters 
(and data-centers): this is the main reason for 
the chaining mechanism. 


Our first experiment with our tool shows that it is 
quite efficient. Our main target was the powering 
on/off of the whole Tera-100 system which leads to 
the execution of more than 4500 actions in less than 
5 minutes for the start and in less than 10 minutes 
for the stop. 
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6 Related Works 


Dependency graph makers exist in various products: 


e Make [11], SCons [12], Ant [13] for example are 
used for the building of softwares; they focus 
on files rather that cluster components and are 
therefore not easily adaptable to our problem. 


e Old System V init [14], BSD init [15], Gentoo 
init|16] and their successors Solaris SMF [17], 
Apple launchd [18], Ubuntu upstart [19] and Fe- 
dora systemd [20] are used during the boot of 
a system and for managing daemons. ‘To our 
knowledge none of those products can be given 
a components list as an input so actions are ex- 
ecuted based on it. 


Solutions for starting/stopping a whole cluster are 
most of the time manual, described in a step- 
by-step chapter of the product documentation and 
hard wired. ‘This is the case for example with 
the Sun/Oracle solution [21] (command ’cluster 
shutdown’). It is not clear whether all components 
in the cluster are taken into account (switches, cool- 
ing components, softwares, ...) and whether new 
components can be added to the shutdown process. 
IBM uses the open-source xcat [22] project and its 
’rpower’ command which does not provide depen- 
dencies between cluster components. 

From a high level perspective, the sequencer, can 
be seen as a command dispatching framework simi- 
lar to Fabric [23], Func [24] and Capistrano [25] for 
example. But the ability to deal with dependencies 
lacks in these products making them unsuitable for 
our initial problem. 

The sequencer can also be seen as a workflow man- 
agement system where the pair DGM/ISM acts as 
a workflow generator, and the ISE acts as a work- 
flow executor. However, the sequencer has not been 
designed for human interactive actions. It does not 
deal for example with user access rights or tasks 
list for example. It is therefore much lighter than 
common user oriented workflow management systems 
such as YAWL [26], Bonita [27], Intalio|BPMS [28], 
jBPM [29] or Activiti [30] among others. 

We finally found a single real product for which 
a comparison has some meaning: ControlTier [31]. 
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ControlTier shares with the sequencer various fea- 
tures such as possible parallel execution of indepen- 
dent actions and failure handling. However, the 
main difference is in the way workflows are produced: 
they are dynamically computed in the sequencer case 
through dependency rules (depsfinder scripts) and 
the component input list whereas they are hard wired 
through configuration files in the case of ControlTier. 

To our knowledge, our solution is the first one to 
address directly the problem of starting/stopping a 
whole cluster or a part of it, taking dependencies 
into considerations, still remaining integrated, effi- 
cient, customizable and robust in the case of failure. 
We suppose the main reason is that clusters are not 
designed to get started/stopped entirely. Long up- 
time is an objective! However automated tools ease 
the management of clusters, making start /stop proce- 
dure faster, reproducible, and reducing human errors 
to a minimum. 


7 Conclusion and Future 


Works 


The sequencer solution presented in this article is the 
first of its kind to our knowledge. It has been de- 
signed with EPO in mind. ‘This is the reason for its 3 
independent stages and for the incremental mode of 
execution. Still the sequencer provides the chaining 
feature making its use pertinent for small clusters or 
small part of a big one. 
The sequencer fulfills our initial objectives: 


e Predictive: the incremental mode allows a com- 
puted instructions sequence to be verified, mod- 
ified and recorded before being run. 


e Easy: executing a recorded instructions 
sequence requires a single command: 
’clmsequencer < instructions.sequence’ 


e Fast: the sequencer can execute independent in- 
structions in parallel with a customizable upper 
limit. 


e Smart: the order in which instructions are exe- 
cuted comes from a dependency graph computed 


from customizable dependency rules and a given 
cluster components list. 


e Robust: failures are taken into account by the 
sequencer component by component. 


Our solution is highly flexible in that most of its inner 
working is configurable such as: 


e the dependency rules, 


the dependency fetching scripts, 


the action to be taken on each component, 


the dependency graph, 
e the final instruction sets. 


The sequencer has been validated on the whole Tera- 
100 system. A shutdown can be done in less than 
10 minutes, and a power on takes less than 5 min- 
utes (more than 4500 actions for both rulesets). 
The sequencer framework will be released under an 
open-source license soon. Several enhancements are 
planned for the end of this year including: 


e smarter failure handling; 


live reporting /monitoring; 


e performance improvement of dependency graph 
generation through caching; 


e post-mortem reporting; 


e replaying. 
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Abstract 


This paper describes a prototype implementation of a 
configuration system which uses automated planning 
techniques to compute workflows between declarative 
states. The resulting workflows are executed using the 
popular combination of ControlTier and Puppet. This al- 
lows the tool to be used in unattended “autonomic” situ- 
ations where manual workflow specification is not feasi- 
ble. It also ensures that critical operational constraints 
are maintained throughout the execution of the work- 
flow. We describe the background to the configuration 
and planning techniques, the architecture of the proto- 
type, and show how the system deals with several exam- 
ples of typical reconfiguration problems. 

Keywords: configuration management, infrastruc- 
ture, cloud computing, planning, IaaS 


1 Introduction 


The growing size and complexity of computing infras- 
tructures has increased awareness of the need for good 
system configuration tools, and most sites now use some 
form of tool to manage their configurations. Further- 
more, declarative specifications are now widely accepted 
as the most appropriate approach - the specification de- 
scribes the “desired” state of the system, and the tool 
computes the necessary actions to move the system from 
its current state into this desired state. This has the ad- 
vantage that the final state of the system is explicitly 
specified, and we can have some confidence that the state 
of the running system matches our requirements. Previ- 
ous approaches were more error-prone because they in- 
volved specifying the actions (for example, using imper- 
ative scripts), and the final outcome would not always 
be obvious. With varying degrees of strictness, most of 
the currently popular tools take a broadly declarative ap- 
proach - for example, Puppet [16] , Cfengine [3], BCFG 
[4] and LCFG [1]. 


However, none of the above tools make any guaran- 
tees about the order of the changes involved in imple- 
menting a configuration change. When creating a new 
service, this is not normally an issue - we specify the re- 
quirements and the tool makes all the necessary changes 
(in some random order). When the tool has finished, we 
have a running system to our specification. However, if 
we are making configuration changes to an existing sys- 
tem, we may well care about the state of the configura- 
tion during the change; for example, if we want to make 
a transition from using one server, to using a different 
one, then we probably want to start the new server, and 
transfer the clients before shutting down the old one. 

Such transitions are often performed manually - the 
administrator will work out a number of intermediate 
stages (server B started, clients all using server B, server 
A stopped), and check that each state has been achieved 
before presenting the tool with the next state. However, 
this is both time consuming, error prone, and not suitable 
for unattended use - for example where we want to make 
a configuration change “autonomically” in response to 
some failure or change in load. 

One approach to this problem has been the use of man- 
ual workflow tools. These allow workflows such as the 
previous example to be captured and stored for automatic 
use - a particular workflow can then be invoked and the 
tool will take care of scheduling the separate stages in the 
given order. ControlTier [5] and IBM Tivoli Provision- 
ing Manager [12] are examples which provide this kind 
of capability. However, this still requires that the work- 
flows are computed manually. Even in a small system, a 
very large number of workflows can be required to cater 
for every eventuality - for example, moving from every 
possible failed state into a working state. And choosing 
an appropriate workflow to suit a particular goal state is 
not always obvious - indeed such manual workflows are 
conceptually similar to the imperative scripts which are 
no longer popular because of their unreliability. 

An alternative approach is to make use of automated 
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planning technology to generate workflows “on the fly”. 
This allows us to specify the current and goal states, to- 
gether with a set of constraints on the intermediate stages 
- for example, all clients must always point at a working 
server. The intermediate states are then computed auto- 
matically, and these can then be presented in order to the 
configuration tool to effect a smooth transition. 

This paper describes an experimental system which 
applies established AI planning tools to automatically 
generate workflows between declarative system states. 
The resulting intermediate states are implementable in 
Puppet and can be scheduled by ControlTier to produce 
a fully-automated system. We start (section 2) with a full 
“walk through” of a simple example, based on the server- 
transition problem described above. Section 3 then cov- 
ers the background in more detail, including system con- 
figuration and automated planning technology. Section 4 
describes the prototype system, section 5 presents some 
more complex examples, section 6 concludes with a dis- 
cussion of some of the problems, and section 7 presents 
possible future directions. 


2 An Example 


Assume we have a system consisting of two servers A 
and B, and one client C. Figure la shows the current 
state: 


1. A.run = true (A is running), 
2. B.run = false (B is stopped), 
3. C.server = A (C is using a service of A). 


The administrator aims to change the configuration to 
the goal state shown in figure 1b Le.: 


1. A.run = false (A is stopped), 
2. B.run = true (B is running), 
3. C.server = B (C is using a service of B). 


Since C depends on the server’s service, the changes 
must be implemented under a particular constraint 1.e. 
C must always reference a running server. 








server B 


server A server B 
(run = false) 


(run = false) (run = true) 





server A 
(run = true) 


client C 
(reference = A) 


(a) Current state 











client C 
(reference = B) 


(b) Goal state 


Figure 1: States of the system. 
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If we use any declarative tool to implement these 
changes, then there are six possible sequences of states 
that could occur L.e.: 


1. A.run = false, C.server = B, B.run = true; 
2. C.server = B, A.run = false, B.run = true; 
3. B.run = true, A.run = false, C.server = B; 
4. A.run = false, B.run = true, C.server = B; 
5. C.server = B, B.run = true, A.run = false; 
6. B.run = true, C.server = B, A.run = false. 


Any of these sequences could appear in practice because 
the declarative tools implement the changes by executing 
the actions in an essentially indeterminate order. Unfor- 
tunately, only one of these sequences (#6), satisfies the 
required constraint while others do not. Hence, a declar- 
ative tool is highly likely to produce change sequence 
which leaves the system inoperative for a period of time 
during the change. 

To address the problem, the automated planning tech- 
nique used in our prototype creates the workflow auto- 
matically, based on the given goal states and the available 
actions. The prototype will generate a workflow which 
consists of a sequence of actions that satisfies an order- 
ing constraint. Each action has preconditions which are 
constraints that have to be satisfied before executing the 
action, and effects which are states that will be attained 
after executing the action. 

The prototype has the following actions pre-defined in 
the actions database: 


1. start-server 
parameters: <server> 
preconditions: <server>.run = false 
effects: <server>.run = true 


2: stop-server 
parameters: <server> 
preconditions: 
<server>.run = true 
(forall <client> 
<client>.server != <server>) 
effects: <server>.run = false 


3. change-reference 
parameters: <serverl> <server2> <client> 
preconditions: 
<client>.server = <server1> 
<server2>.run = true 
<client>.server != <server2> 
effects: <client>.server = <server2> 


To generate the workflow, the administrator (or some 
autonomic system) simply needs to declare the goal 
States: 


1. C.server = B 
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2. A.run = false 


Based on the above goal states and the available ac- 
tions, our prototype generates the following ControlTier 
workflow: 


<command name="config_changes" 
command-type="WorkflowCommand" description="" 
is-static="true" error-handler-type="FAIL"> 
<workflow threadcount="1"> 
<command name="start-server_B"/> 
<command name="reset-reference_A_B_C"/> 
<command name="stop-server_A"/> 
</workflow> 
</command> 


<command name="start-server_B" description="" 
command-type="Command" is-static="true"> 
<execution-string>exec.rb</execution-string> 
<argument-string>start-server.pp B 
</argument-string> 
</command> 


<command name="change-reference_A_B_C" 
description="""command-type="Command" 
is-static="true"> 
<execution-string>exec.rb</execution-string> 
<argument-string>change-reference.pp A BC 
</argument-string> 
</command> 


<command name="sStop-server_A" description="" 
command-type="Command" is-static="true"> 
<execution-string>exec.rb</execution-string> 
<argument-string>stop-server.pp A 
</argument-string> 
</command> 


Submitting this workflow to ControlTier implements the 
configuration change using the valid sequence of actions 
(#6). 

This example shows that the prototype is able to elim- 
inate the sequencing problem that exists in declarative 
tools. Moreover, the prototype also simplifies the config- 
uration tasks since it only requires the administrator to 
declare the goal states, and not the explicit workflow. 


3 Background 


This section summarises the approaches to system con- 
figuration discussed in the introduction, and surveys 
some background work on automated planning tech- 
niques. It concludes with a discussion of some related 
work in applying planning techniques to the configura- 
tion problem. 


3.1 System Configuration 


As noted in the introduction, approaches to system con- 
figuration have mostly evolved via the the following 
Stages: 


e Manual configuration - the administrator manu- 
ally computes the actions necessary to change from 
one configuration to another, and then manually ex- 
ecutes the commands necessary to implement this. 
Clearly, this is error prone, time-consuming, and it 
is difficult to prove reliably that a given sequence 
of changes will result in the required configuration 
under all circumstances. 


e Scripted changes - similar to the previous ap- 
proach, except that the sequence of changes is cap- 
tured in an imperative script, allowing it to be ex- 
ecuted multiple times, on different systems. This 
clearly makes it easier to deal with large numbers 
of systems, and until comparatively recently, this 
was probably the most common approach to con- 
figuration for many people. However, scripting still 
suffers from most of the problems of the manual ap- 
proach. In particular, there is a tendency to blindly 
apply scripts to situations which do not meet the 
necessary preconditions, and the outcome can be 
very unpredictable. 


e Declarative tools - currently, the most common ap- 
proach in practice 1s probably to use a tool which al- 
lows a declarative specification of the desired state. 
The tool will then compute and implement the nec- 
essary changes (in an essentially indeterminate or- 
der). This guarantees that the resulting configura- 
tion matches the required specification, regardless 
of the starting point. As noted in the introduc- 
tion, typical tools include Puppet [16], Cfengine [3], 
BCFG [4] and LCFG [1]. 


e Fixed workflow orchestration - in many cases, 
it is now essential to be able to perform se- 
quences on configuration changes automatically, 
and/or unattanded, and use of fixed workflow tools 
is becoming more common. As noted in the intro- 
duction, ControlTier [5] and IBM Tivoli Provision- 
ing Manager [12] are typical examples. 


3.2 Automated Planning 


Automated planning techniques generate a plan (work- 
flow) automatically by computing a sequence of actions 
which will transition a system from some initial state to 
some required goal state. Each action has preconditions 
which are constraints that have to be satisfied before ex- 
ecuting the action, and effects which are conditions that 
will be true after executing the action!. 


‘Formally, a planning problem can be defined as P = (5,5;,S¢), 
where £ = (S,A,/Y) is a state transition system, s; € S is the initial 
state, and S, C S is a set of goal states, A is a set of actions, Y is a 
state transition function, find a sequence of action (a1,d2,...,@,) cor- 


LISA 711: 25th Large Installation System Administration Conference 59 


60 


Practical implementations of automated planners use 
search algorithms to find an appropriate sequence of ac- 
tions. There are several approaches to improving the ef- 
ficiency of a simple brute-force search: 


e State-space planning uses a subset of the state 
space as the search space where nodes correspond 
to the world states, arcs correspond to the state tran- 
sitions and a path in the search space corresponds to 
the plan. The algorithms try to find a plan that satis- 
fies the goal from the current state using particular 
heuristics to minimize the computing time. Metric- 
FF [10] and SGPlan [11] are examples of planners 
that use this approach. 


e Plan-space planning uses the plan space as the 
search space where nodes are partially specified 
plans and arcs are the plan refinement operations 
intended to further complete a partial plan. The al- 
gorithms try to eliminate all the flaws in the initial 
partial plan which is either an unsatisfied sub-goal 
or a threat. The final plan will bring the system from 
the initial to the goal state. Planners that use this ap- 
proach include UCPOP [15] and VHPOP [20]. 


e Planning-Graph uses a graph structure where 
nodes correspond to world state propositions and 
actions, and arcs correspond to preconditions and 
effects of actions. The algorithms expand the graph 
from the initial state until reaching the last layer that 
contains all goals which must not be mutually ex- 
clusive. The solution (plan) can be found by apply- 
ing a backward-search algorithm from the last until 
reaching the first proposition layer. Graphplan plan- 
ner [2] and LPG [8] are examples of planners that 
use this approach. 


e Hierarchical Task Network (HTN) planning uses 
algorithms that decompose the given tasks using 
pre-defined methods until it reaches a set of prim- 
itive tasks and no non-primitive tasks remain. The 
tasks are organized as a collection called a task net- 
work which consists of a set of tasks and a set of 
constraints. O-Plan [19] is an example of planner 
that use this approach. 


3.3. Related Works 


There has been some previous works on the use of auto- 
mated planning techniques for sequencing configuration 
changes in computing infrastructures. For example: 
Keller et al. [13] introduced CHAMPS which trans- 
lates the requested operations into a set of imperative 


responding to a sequence of state transitions (5;,51,...,5,) such that 
51 = Y(si,41) ,82 = (81, 42),---,8 = V(SK—1, 4%), and sy € Sy. 
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tasks and organizes them as a workflow to satisfy the 
constraints as well as maximize the high degree of par- 
allelism. However CHAMPS does not reason about the 
current state of target system as well as the preconditions 
and effects of each task which could lead to an unsound 
plan. 

In [6], an object modelling language is used to specify 
the goal states and the operational capabilities of the con- 
figured components. The model is mapped in Planning 
Domain Definition Language (PDDL) [7] and given as 
the input to a POP (partial-order planning) planner which 
generates the workflow. Hagen [9] models the compo- 
nent life-cycle using CIM (Common Information Model) 
objects which are stored in a Configuration Management 
Database (CMDB). Based on the defined goal states, the 
state-space planner called LPS (Logical Planning Stat) 
then directly manipulates the objects in the CMDB to 
generate the workflow. 

Both approaches demonstrate the viability of auto- 
mated planning techniques for changes to the configu- 
ration of a computing infrastructure. They also provide 
very flexible solutions. However the modelling and the 
specifications are comparatively complex, and it is not 
clear how these might be exposed to end-users in a prac- 
tically useful way. Levanti [14] provides a promising ap- 
proach to simplifying the interface to the planner — the 
user 1S presented with a set of tags which hide much of 
the configuration details (states and operations). This en- 
ables the user to define and refine the goal state by it- 
eratively selecting one or more tags. The workflow is 
generated by mapping the selected tags in SPPL (Stream 
Processing Planning LanguageL) [17] as the input for the 
SPPL planner [18]. 

We are not aware of any other work which meets our 
specific aims of using a standard planner to create a sys- 
tem which interfaces easily with common system config- 
uration tools. 


4 Prototype 


We have developed a prototype implementation which 
combines an automated planner, together with Con- 
trolTier and Puppet to generate and execute plans for 
configuration changes. This prototype is definitely not 
(yet) a production-quality tool. However, our main aim 
has been to prove that the concepts would be applicable 
to areal environment, so the tool uses production-quality 
components, and is capable of generating practical work- 
flows from specifications of realistic problems. 

As illustrated in figure 2, the prototype consists of four 
main components 1.e. actions database, translator, plan- 
ner, and mapper. More details of the architectures’ com- 
ponent are described as follows (each number represents 
the component’s number in figure 2): 
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Figure 2: System Architecture of The Prototype 


e The actions database (2) holds a library of actions, 
together with their required preconditions and ef- 
fects. The actions can be written by anyone such as 
third-party software vendors, in-house software en- 
gineers, system administrators, or other specialists. 


e A tool called facter (11) is used to acquire the cur- 
rent state (3) of the system. The outputs are ag- 
gregated by a translator (12) and then mapped into 
PDDL. 


e The administrator specifies a declarative goal state 
(1) which is then mapped into PDDL. 


e The planner (4) generates a plan (5) that will im- 
plement the new specification on the target system. 
This is based on the available predefined actions 
(from the actions database), the current state (from 
facter) and the goal state (supplied by the adminis- 
trator). 


e The mapper (6) uses the plan (5) to generate a Con- 
trolTier workflow-command (7) which consists of 
a set of other workflow-commands or primitive- 
commands. For each primitive-command, the map- 
per generates a puppet manifest file (8) to imple- 
ment the action associated with the plan. 


e ControlTier (9) manages the execution of the work- 
flow by sending the puppet manifest file to the ap- 
propriate target node and requesting Puppet (10) to 
implement it. 


In the latest prototype, we use LPG [8] as the plan- 
ner. The main reason is that LPG can generate a plan 
(workflow) where all actions in each stage are mutually 
exclusive. This is an advantage because actions of each 
stage can be invoked in parallel by declarative tools to 
reduce the execution time. However, since we use the 


PDDL version 2.1 as the standard input for the planner, 
the prototype can utilize any available automated planner 
that supports it. 


Currently, besides the translator, all parts have been 
implemented in Ruby and C. The implementation of the 
translator is straightforward 1.e. translating the facts that 
are aggregated by the facter to a set of propositions in 
PDDL. For the actions database, all available actions are 
currently stored as individual files. If the number of ac- 
tions were to become large, we could use a more struc- 
tured database to improve storing and querying perfor- 
mance. For configuring other equipment (e.g. a router), 
we could employ a proxy server as a bridge to communi- 
cate with the target equipment in order to implement the 
new specification and acquire its current state. 


In the process of generating the plan, the planner may 
use generic and domain-specific actions. A generic ac- 
tion, which is called as a “configuration pattern’, is a 
reusable action which is applicable on any configuration 
problem. Whilst a domain-specific action is an action 
which is applicable to particular configuration problem. 
We will show the examples of these types of action in the 
experiments section. 


An error could occur during the implementation of any 
part of the plan. For example, the “change-reference”’ 
action cannot be executed if the target server is broken. 
To address this problem, the prototype could be set to 
identify the error from the execution log and perform the 
re-planning process to compute an alternative plan in or- 
der to attain the same goal state. If the alternative plan 
exists, it will then be implemented on the target system. 
Otherwise, the prototype could ask the administrator to 
modify the goal state. 

The prototype could also have a self-healing capabil- 
ity simply by evaluating the current and the goal state 
periodically. It will then generate and execute a plan for 
correcting any drift in the configuration of the system. 
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5 Experiments 


5.1 Web Services 


In the first experiment, we reconfigure a system consist- 
ing of two web services WS-A and WS-B, a client PC, 
and a firewall FW. Currently, PC is using a web service 
provided by WS-A through port 8080 of FW and WS-B 
is stopped. As shown in figure 3a, the system’s current 
state 1s: 


1. WS-A.run = true 
2. WS-A.enable_firewall = true 
3. WS-A.FW.port = 8080 


i 


WS-B.run = false 


WS-B.enable_firewall = true 


PC.service = WS-A 
7. FW.ports(8080).open = true 
8. FW.ports(9090).open = false 


The administrator aims to shutdown WS-A for main- 
tenance and redirect PC’s reference to WS-B. This will 
change the configuration to the goal state shown in figure 
3b which can be specified declaratively as follows: 


1. WS-A.run = false 

2. PC.service = WS-B 

3. WS-B.FW.port = 9090 

4. FW.ports(8080).open = false 


In addition, the administrator must satisfy the follow- 
ing constraints in the implementation of the changes: 


1. The PC depends on the web service, thus it must 
always reference to a running web service. 


2. Any unused port of F must be closed to minimize 
the vulnerability of the system. 


To enable the planner to generate the right workflow, 
the first constraint is put in the preconditions of action 
stop-service. And the second one is declaratively speci- 
fied in the goal state (#4). The following applicable ac- 
tions are available in the actions database: 


1. start-service 
parameters: <service> <vm> 
preconditions: 
<service>.run = false 
<vm>.has = <service> 
<vm>.run = true 
effects: <service>.run = true 


LISA *11: 25th Large Installation System Administration Conference 








Company’s LAN 


S S 
WS-B WS-A WS-B 
(run = false) (run = false) (run = true) 


Company’s LAN 


{ 


Tf 






WS-A 
(run = true) 






| port(8080) 

















(b) Goal state 


(a) Current state 


Figure 3: The states of the web services system. 


2. stop-service 
parameters: <service> 
preconditions: 
<service>.run = true 
(forall (<client>) 
<client>.service != <service>) 
effects: <service>.run = false 


o open-fport 
parameters: <firewall> <port> 
preconditions: 
<firewall>.<port>.open = false 
effects: 
<firewall>.<port>.open = true 


4. close-fport 

parameters: <firewall> <port> 
preconditions: 

<firewall>.<port>.open = true 

(forall (<service>) 

<service>.<firewall>.port = <port>) 

effects: 

<firewall>.<port>.open = false 


5. assign-fport 
parameters: <servicel> <firewall> <port> 
preconditions: 
<firewall>.<port>.open = true 
<servicei>.enable_firewall = true 


<servicei>.<firewall>.port != <port> 
(forall (<service2>) 
<service2>.<firewall>.port != <port>) 
effects: 


<servicei>.<firewall>.port = <port> 


6. unassign-port 
parameters: <service> <firewall> <port> 
preconditions: 
<service>.<firewall>.port = <port> 
(forall (<client>) 


<client>.service != <service>) 
effects: 
<server>.<firewall>.port != <port> 


7, change-ref-fport 
parameters: <servicel> <service2> 
<client> <firewall> <port> 
preconditions: 
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<client>.service = <servicel> 
<client>.service != <service2> 
<service2>.run = true 
<service2>.<firewall>.port = <port> 
effects: 

<client>.service != <servicel> 
<client>.service = <service2> 


By using information of the current and goal state with 
the application actions in the actions database, the proto- 
type generated the following ControlTier workflow: 


<command name="config_changes" 
command-type="WorkflowCommand" description="" 
is-static="true" error-handler-type="FAIL"> 
<workflow threadcount="1"> 
<command name="sub-workflow-1"/> 
<command name="assign-fport_WS-B_FW_P9090"/> 
<command name= 
"change-ref-fport_WS-A_WS-B_PC_FW_P9090"/> 
<command name="sub-workflow-2"/> 
<command name="close-fport_FW_P8080"/> 
</workf low> 
</command> 


<command name="sub-workflow-1" 
command-type="WorkflowCommand" description="" 
is-static="true" error-handler-type="FAIL"> 
<workflow threadcount="2"> 
<command name="start-service_WS-B_VM-B"/> 
<command name="open-f port _FW_P9090"/> 
</workf low> 
</command> 


<command name="sub-workflow-2" 
command-type="WorkflowCommand" description="" 
is-static="true" error-handler-type="FAIL"> 
<workflow threadcount="2"> 
<command name="stop-service_WS-A"/> 
<command name="unassign-fport_WS-A_FW_P8080"/> 
</workf low> 
</command> 


The prototype also generated the primitive ControlTier 
commands as shown in appendix B. 

The generated workflow is a partial-order workflow 
which consists of one main workflow-command (con- 
fig_changes) and two sub-workflow-commands (sub- 
workflow-1 and sub-workflow-2). config_changes 1s set 
to be executed by one thread to enforce the ordering con- 
straint, 1.e. a command must be invoked after the pre- 
vious one has finished successfully. On the other hand, 
each sub-workflow-commands is set to be executed by 
two threads*to enable the parallel execution. This is pos- 
sible since all commands of the sub-workflow-command 
are mutually exclusive. 

To implement the changes, the workflow was submit- 
ted by the mapper to ControlTier which coordinated the 
execution of each commands. Puppet then used the ap- 
propriate manifest file to acheive the desired state. 


*The number of threads is the same as the number of primitive com- 
mands. 


5.2 Cloud Burst 
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Figure 4: The states of the company’s system before and 
after the cloud-burst scenario. 
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In the second experiment, we simulated the cloud- 
burst scenario on the computing infrastructure. In this 
scenario, an organization must dynamically deploy its 
software application from its limited internal computing 
resources to the public cloud in order to address a spike 
in demand. 

We assumed a company has a private cloud infrastruc- 
ture which runs various services to serve its 24-hours 
operations. One of them, WS-A which is running on 
virtual machine VM-A, is the most important web ser- 
vice since it processes all financial transaction from com- 
pany’s branch offices. Thus, the administrator has pre- 
pared a backup web service, WS-B which is installed on 
virtual machine VM-B, in case there is a failure on WS- 
A. 

Unfortunately, due to the limited resource of the phys- 
ical machines, the company’s private cloud infrastructure 
is not capable of serving the spikes in demand which usu- 
ally happens on the last three days of each month. There- 
fore, before the spike’s period, the administrator plans to 
migrate WS-A temporarily to the public cloud to mini- 
mize its response time. 

The migration of WS-A from the private to the public 
cloud is not an easy task since the administrator must 
satisfy the following constraints: 


1. During the migration process, the service must al- 
ways available for 24-hours a day without any 
down-time. 
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2. The company’s firewall must be reconfigured to al- 
low the LAN PCs to have connection with the server 
on public cloud. 


3. The web service application cannot be installed on 
any other machines due to the limitation of the li- 
cense. 


Based on the above scenario, the current state of the 
system is illustrated in figure 4a which can be specified 
declaratively as: 


1. VM-A.cloud = PRIV-CLOUD 
2. VM-A.run = true 

3. WS-A.on = VM-A 

4. WS-A.run = true 

5. PC.service = WS-A 

6. VM-B.cloud = PRIV-CLOUD 
7. VM-B.run = false 

8. WS-B.on = VM-B 


9. WS-B.run = false 


Where PRIV-CLOUD and PUB-CLOUD are the private 
and public cloud infrastructure respectively. 

To enable the cloud-burst, the system needs to achieve 
the goal state as illustrated in figure 4b. Therefore, the 
administrator can reconfigure the system using our pro- 
totype by declaring the goal state as: 


1. VM-A.cloud = PUB-CLOUD 
2. WS-A.FW.port = 8080 

3. PC.service = WS-A 

4. VM-B.cloud = PRIV-CLOUD 


5. VM-B.run = false 


Where FW is the name of the company’s firewall. 

Fortunately, to generate the workflow, we only need to 
add five actions to the actions database since the planner 
can reuse the actions from the previous examples. The 
five new actions are: 


1. start-vm 
parameters: <vm> <cloud> 
preconditions: 
<cloud>.has = <vm> 
<vm>.run = false 
effects: 
<vm>.run = true 
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2 stop-vm 
parameters: <vm> 
preconditions: 


<vm>.run = true 
(forall <service> 
if <vm>.has = <service> 
then <service>.run = false) 
effects: 
<vm>.run = false 


3. change-ref 

parameters: <service-1> <service-2> <client> 

preconditions: 
<client>.service = <service-1> 
<client>.service != <service-2> 
<service-2>.run = true 
<service-2>.enable_firewall = false 

effects: 
<client>.service != <service-1> 
<client>.service = <service-2> 


4. migrate 
parameters: <vm> <cloud-1> <cloud-2> 
preconditions: 


<vm>.run = false 

<cloud-1>.has = <vm> 

<cloud-2>.is_public = true 
effects: 

'(<cloud-1>.has = <vm>) 

<cloud-2>.has = <vm> 


5. set-need-firewall 
parameters: <service> 
preconditions: 
(forall (<firewall> <port>) 
<service>.<firewall>.port != <port>) 
(forall (<vm> <cloud>) 
if (<vm>.has = <service> 
and <cloud>.has = <vm>) 
then <cloud>.is_public = true) 
effects: 
<service>.enable_ firewall = true 


Some of these reusable actions, such as stop-service 
and start-service, typically occur in many different situ- 
ations and form a set of generic patterns. 

After processing the information, the planner will give 
its output to the mapper which generated the following 
ControlTier workflows: 


<command name="config_changes" 
command-type="WorkflowCommand" description="" 
is-static="true" error-handler-type="FAIL"> 
<workflow threadcount="1"> 
<command name="sub-workflow-1"/> 
<command name="start-service_WS-B_VM-B"/> 
<command name="change-ref_WS-A_WS-B_PC"/> 
<command name="stop-service_WS-A"/> 
<command name="stop-vm_VM-A"/> 
<command name= 
"migrate_VM-A_PRIV-CLOUD_PUB-CLOUD"/> 
<command name="sub-workflow-2"/> 
<command name="sub-workflow-3"/> 
<command name= 
"change-ref-fport _WS-B_WS-A_PC_FW_P8080"/> 
<command name="stop-service_WS-B"/> 
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<command name="stop-vm_VM-B"/> 
</workflow> 
</command> 


<command name="sub-workflow-1" 
command-type="WorkflowCommand" description="" 
is-static="true" error-handler-type="FAIL"> 
<workflow threadcount="2"> 
<command name="open-fport _FW_P8080"/> 
<command name="start-vm_VM-B_PRIV-CLOUD"/> 
</workflow> 
</command> 


<command name="sub-workflow-2" 
command-type="WorkflowCommand" description="" 
is-static="true" error-handler-type="FAIL"> 
<workflow threadcount="2"> 
<command name="set-need-firewall_WS-A"/> 
<command name="start-vm_VM-A_PUB-CLOUD"/> 
</workflow> 
</command> 


<command name="sub-workflow-3" 
command-type="WorkflowCommand" description="" 
is-static="true" error-handler-type="FAIL"> 
<workflow threadcount="2"> 
<command name="assign-fport_WS-A_FW_P8080"/> 
<command name="start-service_WS-A_VM-A"/> 
</workflow> 
</command> 


The prototype also generated the primitive ControlTier 
commands as shown in appendix C. 

The planner generated the partial-order plan (work- 
flow) which has three sub-workflows. | Each sub- 
workflow has a set of commands that can be run in paral- 
lel due to their a mutual exclusive property. Submission 
the workflows to ControlTier implemented the new con- 
figuration specification that enable WS-A servicing more 
clients than before. 

If the administrator would like to stop using the public 
cloud, WS-A can be migrated back to the private cloud 
by easily changing the goal state of the system. The pro- 
totype will generate and execute automatically the work- 
flow to implement the new specification. In this case, we 
can also have a full autonomic configuration tool by re- 
placing the administrator with an autonomic agent which 
will automatically trigger the migration of the system 
from private to the public cloud or vice versa based on 
the demands. 


6 Conclusions 


This work has clearly demonstrated the advantages of 
automated planning for system reconfiguration — work- 
flows can be automatically generated (providing that a 
solution exists) between any two declarative states, en- 
abling unattended, autonomic reconfiguration for failure 
recovery or other reasons. The generated workflows are 
guaranteed (by design) to achieve the desired target state, 


at the same time as preserving any necessary proper- 
ties of the system during the reconfiguration. We have 
also shown that it is possible to build a practical tool 
which generates workflows automatically, and uses exist- 
ing production-quality tools for the deployment (as well 
as the planning). 

However, we suspect that the usability of such sys- 
tems will be a major challenge — firstly, languages and 
interfaces are required to enable working administrators 
to easily translate their requirements and specifications 
into a form that is usable by the planners. Secondly, ad- 
ministrators need to have confidence that the system will 
behave in a predictable way — planners are very good at 
exploiting a lack of precision in the specification to find 
very “creative” and unexpected solutions! The human 
interaction aspects of this problem are something which 
would benefit from future work. 

Error recovery is also a very important area. Reconfig- 
urations often occur in precisely those situations where 
the system itself is unreliable — for example, during net- 
work and components failures, or system overload. Plans 
are likely to fail at some intermediate stage, or a cen- 
tralised planner may become disconnected and lose track 
of the current state of an executing plan. 


7 Future Work 


We are currently interested in investigating more dis- 
tributed, and localised approaches to automated planning 
for configuration changes. This will allow more auton- 
omy for individual components (thus improving the re- 
silience) and break the planning problem into a hierarchy 
of problems which are easier to understand and predict. 

We believe that our implementation is much closer 
than previous work to providing a practical solution for 
system administrators who are familiar with current con- 
figuration tools such as Puppet. However, most ad- 
ministrators would still be unhappy to allow significant 
changes to their infrastructure by a completely auto- 
mated system - the chances of unexpected and inappro- 
priate solutions are still too high. We believe that there 
is considerable scope here for further work on appropri- 
ate languages and interfaces - perhaps involving mixed 
initiative solutions which combine automated planning 
with human guidance, and automated explanations of 
proposed solutions. 
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A The Flow-Chart of The Workflows 


Figure 5a and 5b illustrate the flow-charts of the gen- 
erated workflows of web services and cloud burst ex- 


amples. 


Each actions are associated with a ControlTier 


command as follows: 
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(b) The Workflow of Cloud Burst 


Figure 5: The flow-chart of the workflows. 


B- Primitive ControlTier Commands of 
Web Services 


<command 
name="start-service_WS-B_VM-B" 
description="" command-type="Command" 


is-static="true"> 
<execution-string>exec.rb</execution-string> 
<argument-string>stop-service.pp WS-B VM-B 
</argument-string> 
</command> 


<command name="open-fport_FW_P9090" 
description="" command-type="Command" 
is-static="true"> 
<execution-string>exec.rb</execution-string> 
<argument-string>stop-open-fport.pp FW 9090 
</argument-string> 
</command> 


<command name="assign-fport_WS-B_FW_P9090" 
description="""command-type="Command" 
is-static="true"> 
<execution-string>exec.rb</execution-string> 
<argument-string>assign-fport.pp WS-B FW 9090 
</argument-string> 


</command> 

<command 
name="change-ref-fport_WS-A_WS-B_PC_FW_P9090" 
description=""" command-type="Command" 


is-static="true"> 
<execution-string>exec.rb</execution-string> 
<argument-string>change-ref-fport.pp WS-A WS-B 
PC FW 9090</argument-string> 
</command> 


<command name="stop-service_Ws-A" 
description="""command-type="Command" 
is-static="true"> 
<execution-string>exec.rb</execution-string> 
<argument-string>stop-service.pp WS-A 
</argument-string> 
</command> 


<command name="assign-fport_WS-A_FW_P8080" 
description=""" command-type="Command" 


is-static="true"> 
<execution-string>exec.rb</execution-string> 
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<argument-string>unassign-fport.pp WS-A FW 8080 
</argument-string> 
</command> 


<command name="close-fport_FW_P8080" 
description="""command-type="Command" 
is-static="true"> 
<execution-string>exec.rb</execution-string> 
<argument-string>close-fport.pp FW 8080 
</argument-string> 
</command> 


C Primitive ControlTier Commands of 
Cloud-Burst 


<command name="open-fport_FW" 
description="" command-type="Command" 
is-static="true"> 
<execution-string>exec.rb</execution-string> 
<argument-string>open-fport.pp FW 8080 
</argument-string> 
</command> 


<command name="start-vm_VM-B_PRIV-CLOUD" 
description="" command-type="Command" 
is-static="true"> 
<execution-string>exec.rb</execution-string> 
<argument-string>start-vm.pp B PRIV-CLOUD 
</argument-string> 
</command> 


<command name="Start-service_WS-B_VM-B" 
description=""" command-type="Command" 
is-static="true"> 
<execution-string>exec.rb</execution-string> 
<argument-string>start-service.pp WS-B VM-B 
</argument-string> 
</command> 


<command name="change-ref_WS-A_WS-B_PC" 
description=""" command-type="Command" 
is-static="true"> 
<execution-string>exec.rb</execution-string> 
<argument-string>change-ref.pp WS-A WS-B PC 
</argument-string> 
</command> 
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<command name="stop-service_WS-A" 
description="""command-type="Command" 
is-static="true'"> 
<execution-string>exec.rb</execution-string> 
<argument-string>stop-service.pp WS-A 
</argument-string> 
</command> 


<command name="stop-vm_VM-A" 
description="""command-type="Command" 
is-static="true"> 
<execution-string>exec.rb</execution-string> 
<argument-string>stop-vm.pp VM-A 
</argument-string> 


</command> 

<command 
name="migrate_VM-A_PRIV-CLOUD_PUB-CLOUD" 
description="""command-type="Command" 


is-static="true"> 
<execution-string>exec.rb</execution-string> 
<argument-string>migrate.pp VM-A PRIV-CLOUD 
PUB-CLOUD</argument-string> 

</command> 


<command name="set-need-firewall_WS-A" 
description="" command-type="Command" 
is-static="true"> 
<execution-string>exec.rb</execution-string> 
<argument-string>set-need-firewall.pp 
WS-A</argument-string> 
</command> 


<command name="sStart-vm_VM-A_PUB-CLOUD" 
description="" command-type="Command" 
is-static="true"> 
<execution-string>exec.rb</execution-string> 
<argument-string>start-vm.pp VM-A PUB-CLOUD 
</argument-string> 
</command> 


<command name="assign-fport_WS-A_FW_P8080" 
description="" command-type="Command" 
is-static="true'"> 
<execution-string>exec.rb</execution-string> 


<argument-string>assign-fport.pp WS-A FW 8080 


</argument-string> 
</command> 


<command name="Start-service_WS-A_VM-A" 
description="" command-type="Command" 
is-static="true"> 
<execution-string>exec.rb</execution-string> 
<argument-string>start-service.pp WS-A VM-A 
</argument-string> 


</command> 

<command 
name="change-ref-fport_WS-B_WS-A_PC_FW_P8080" 
description="" command-type="Command" 


is-static="true"> 
<execution-string>exec.rb</execution-string> 


<argument-string>change-ref-fport.pp WS-A WS-A 


PC FW 8080</argument-string> 
</command> 


<command name="stop-service_WS-B" 
description="" command-type="Command" 
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is-static="true"> 
<execution-string>exec.rb</execution-string> 
<argument-string>stop-service.pp WS-B 
</argument-string> 
</command> 


<command name="stop-vm_VM-B" 
description="" command-type="Command" 
is-static="true"> 
<execution-string>exec.rb</execution-string> 
<argument-string>stop-vm.pp VM-B 
</argument-string> 
</command> 
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Abstract 


System configuration tools automate the configuration 
and management of IT infrastructures. However these 
tools fail to provide decent authorisation on configuration 
input. In this paper we apply fine-grained authorisation 
of individual changes on a complex input language of 
an existing tool. We developed a prototype that extracts 
meaningful changes from the language used in the Pup- 
pet tool. These changes are authorised using XACML. 
We applied this approach successfully on realistic access 
control scenarios and provide design patterns for devel- 
oping XACML policies. 


1 Introduction 


The management of large IT infrastructures needs to be 
automated to keep it manageable and reduce the amount 
of human errors [3, 11]. A system configuration tool is 
software that enables a system administrator to automate 
the configuration and management of large IT infrastruc- 
tures. These tools address scalability, heterogeneity, and 
the consistency of relations between machines [2]. All 
system configuration tools have a similar reference archi- 
tecture: each managed device runs an agent that manages 
the configuration of that device. The agent compares 
the current state of the device with the state described 
in a policy that is stored in a database or repository on a 
central server. This policy determines the configuration 
and the state of the entire IT infrastructure. Therefore if 
someone adds unauthorised changes to the central policy 
this person can control the entire IT infrastructure. Thus 
access control to this central policy is required. 

System configuration tools can be divided into two 
categories based on how the input policy is organised: 
database or textual [5] based. The database based tools 
often use a graphical interface or command interface to 
manipulate their policies. Access control in these tools 
is enforced on records in that database. The other input 
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Figure 1: Overview of the solution presented in this pa- 
per. 


type uses textual configuration files. The current state- 
of-practice of these text based system configuration tools 
uses path based access control to prevent unauthorised 
access to the textual configuration files but the name and 
the path of the file often do not have a relation with the 
contents of the file. To be able to use conventional path 
based access control, current tools rely on conventions 
to determine in which file what configuration statement 
may be included. For example network related config- 
uration can only be defined in the network.cf config- 
uration file. System management tools and path based 
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access control however cannot prevent a malicious user 
from adding network configuration statements to the file 
motd.cf. 


In “Federated Access Control and Workflow Enforce- 
ment in Systems Configuration” [10] we proposed a 
method called ACHEL to enforce fine-grained access 
control based on the semantics of a change. In other 
words ACHEL calculates the operations the user wants 
to authorise using the changes in a textual file. We ap- 
plied this approach to a minimal configuration language 
to prove its viability in a prototype. This prototype used 
a custom access control language based on regular ex- 
pressions. Our method is language agnostic, except for 
the part to give meaning to each change. Figure | show 
the steps in the ACHEL method. 


In this paper we apply the ACHEL method on a con- 
figuration language of an existing system configuration 
tool. The two research objectives of this paper are: 


1. Can we extract the AST from the compiler of a sys- 
tem configuration tool and can we reuse this internal 
AST or do we need to transform it in order to obtain 
differences that are semantically meaningful? 


2. How do we authorise changes? Once changes are 
known they have to be authorised. We propose 
an access control language and design patterns in 
this paper that provide the flexibility to express the 
different rules in a manageable and understandable 
fashion. 


In this paper we add fine-grained access control 
based on meaningful changes to Puppet [1] and used 
XACML [8] to authorise changes. We implemented a 
prototype and integrated it with a version control system. 
Afterwards we evaluated this prototype by comparing it 
with traditional access control in two change scenarios. 
An important subset of the Puppet language is supported 
and we present design patterns to use the full expressive- 
ness of the Puppet language including the unsupported 
language constructs with our access control mechanism. 


In the remainder of this paper we first give some back- 
ground on the ACHEL authorisation mechanism in Sec- 
tion 2. Then we look at related work in Section 3. Af- 
terwards we discuss the methods used in the differenc- 
ing process in Section 4. In this section we also discuss 
the problems we encountered by using the AST Puppet 
generates. The next step is the actual authorisation. Sec- 
tion 5 introduces the XACML framework and proposes 
a design pattern for writing policies. Finally in Section 6 
we compare our method with other authorisation meth- 
ods. 
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2 The ACHEL authorisation mechanism 


The configuration model used by a system management 
tool is compiled from an input in the form of textual 
source code. This source input is stored on a filesys- 
tem or in a repository that uses version tracking. Access 
control and authorisation in the state of the art is based 
on operations performed on files and directories. In state 
of the art system management tools there is often no link 
between the file path and the parts of the configuration 
model represented in the file. Version control systems 
use diff-like algorithms [9] that operate on flat files to 
generate changes between two versions of a file. Diff al- 
gorithms detect changed lines and produce a list of insert 
and remove line operations. Applying access control on 
these operations does not make much sense. The oper- 
ations are highly syntax dependant and there is only a 
weak link between the insert and remove operations and 
the configuration model. 

In large infrastructures updates are never applied di- 
rectly to the production infrastructure. Depending on the 
contents of the update or the person that produced the 
update, different authorisations can be required. For ex- 
ample: 


1. all changes from junior administrators need to be 
reviewed and approved by a senior administrator 


2. the scenario in Figure 2 where a change needs to be 
approved by a manager 


3. all changes to the production infrastructure out- 
side maintenance windows require approval by two 
managers 


4. in a federated infrastructure changes to the back- 
bone network need to be approved by the manage- 
ment of each of the administrative domains 


Existing system management tools and access control so- 
lutions provide no support for these complex workflows. 

Our method [10] transforms the updates on the con- 
figuration model by comparing the current and the new 
version of the input source. It compiles the two versions 
to an abstract syntax tree [7]. From the two versions an 
edit script is generated that transforms the old AST to 
the AST of the new version [4]. This process is repre- 
sented in Figure 1. Because we are working on the AST, 
we know the semantics of changes made to the nodes in 
the abstract syntax tree. Therefore the edit script can be 
transformed to operations on entities that exist in the con- 
figuration model. Using our method, access control rules 
can be expressed in terms of operations on the entities 
in the configuration model. For example, instantiating a 
new resource, instead of adding a line to the input file 
that has a given syntax. These operations are the actions 
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Figure 2: Updating the configuration model using access control. 


that needs to be authorised. Additionally this method, 
opposed to other access control methods, derives the op- 
erations to authorise automatically and requests permis- 
sion to apply them. 


For audit purposes the configuration model is often 
stored in version controlled repositories. These reposito- 
ries record each change to a configuration file and meta- 
data such as the user that made the change and an op- 
tional log message. In ACHEL changes to this repository 
are digitally signed with the private key of the adminis- 
trator. During generation of the edit script and the trans- 
formation of the edit script, the owner of each entity and 
the author of each change is tracked. The owner of an 
entity is the user that added or modified the entity. This 
ownership and author information is also exposed to the 
access control engine. 


We enforce update workflows by using distributed ver- 
sion control repositories. Each system administrator that 
makes changes to the configuration model has their own 
repository. Distributed version control repositories as- 
sign a unique identifier to each change based on the con- 
tents of the change. To enforce update workflows, a 
change is authorised by the owner of a key by signing 
this unique identifier and including it as an update in 
the repository. Access control rules can require the au- 
thorisation of a third party before an update is allowed. 
Because each distributed version control repository can 
have its own set of access control rules, very flexible up- 
date workflows can be enforced. 


Figure 2 represents a possible scenario supported by 
ACHEL [10]. A system administrator makes a change 
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that is allowed in his repository but it requires approval 
by a manager to push the change into the repository for 
the production infrastructure. The sysadmin requests the 
manager to review his change. The manager reviews the 
change and approves it by signing the identifier of the 
change. The sysadmin can now push his change to the 
production repository together with the signature of the 
manager. 


3 Related work 


In “A survey of system configuration tools” [5] we eval- 
uated several system configuration tools, including their 
support for access control and authorisation of changes. 
We identified two types of authorisation: either path 
based access control or access control based on “re- 
sources” in the configuration model. The tools that 
support external version repositories can reuse the path 
based access control of that repository or the access con- 
trol models that the filesystem provides. Other tools 
allow fine grained access control on “resources” in a 
database using a hierarchy of resources. The system con- 
figuration tools that enforce authorisation on “resources” 
do this on resources in the configuration model that is 
used to generate and deploy configuration files and man- 
age each system. The main disadvantage of this method 
is that authorisation cannot be performed on language 
constructions that are determined at runtime. For exam- 
ple the usage of a Collect instruction in Puppet. 
“Authorisation and Delegation in the Machination 
Configuration System” [6] proposes a method of organ- 
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ising and delegating access to configuration information. 
The author integrated this method in the configuration 
management tool Machination. One of the key require- 
ments of his method is the ability to authorise access 
to configuration aspects individually. He accomplishes 
this requirement by authorising the primitive operations 
which manipulate the configuration. The configuration 
representation used in Machination, is a form of XML 
with additional restrictions. These restrictions assure ev- 
ery configuration element is addressable by an XPath 
query. Upon this representation, a set of primitive op- 
erations is defined. These operations edit the configu- 
ration by adding, removing, changing and ordering the 
individual elements in the configuration input. Authori- 
sation is then performed upon the individual operations 
needed to transform the configuration. By grouping mul- 
tiple elements together such that they can be referred to 
by an XPath query, multiple configuration aspects can be 
authorised. 

Both tools use the principle of authorisation on the in- 
dividual elements. Where Machination starts from one 
version and uses the operation to obtain the new version, 
ACHEL derives the operations that need to be authorised 
from the two versions. This paper describes a method 
in which the differencing of ACHEL is used to find the 
changes made to a file. 


4 Extracting changes 


The ACHEL authorisation method starts from the con- 
figuration file that has been changed. The method con- 
sists of two phases. The first phase retrieves the AST 
of each file from the Puppet compiler. This AST should 
not contain any grammatical constructs anymore for the 
differencing algorithm to work. This is not the case for 
the AST Puppet produces, it still contains some syntac- 
tical leftovers. Therefore a transformation step is also 
included in this phase. The second phase compares the 
two trees and calculates an edit script that describes the 
operation to transform the first AST into the second AST. 
This edit script is transformed into meaningful changes 
expressed in terms of the language constructs in the Pup- 
pet language. 

The Puppet language is an expressive language that 
also contains control flow and runtime evaluated expres- 
sion such as the case statement or virtual and exported 
resources. Applying access control to changes that in- 
clude these language constructs is very hard with our 
method because the effects of such statements are only 
known when the “configuration policy” of each managed 
device is calculated. In this prototype of the ACHEL au- 
thorisation method for Puppet we support a limited set of 
language constructs in the Puppet language on which we 
can apply authorisation. This set includes creating defi- 
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nitions, classes, creating resources including using arrays 
as identifier and relations. In Section 5.2 we will argue 
why this limited set can already be used to create power- 
ful access control policies. 


4.1 Generating the abstract syntax tree 


The ACHEL authorisation mechanism requires access to 
the AST of each version of the Puppet manifests. In this 
section we describe how we extract the AST from Pup- 
pet. The AST from Puppet is not directly usable for our 
mechanism because it contains syntactical constructions. 
Therefore we need to normalise this AST. 

We use the Puppet parser to create the AST of a Pup- 
pet input file. This provides us the AST that Puppet rea- 
sons upon. However this AST is not suited for generating 
an edit script. Although it is an abstract syntax tree, the 
tree still contain syntactical language constructs from the 
Puppet language. This is a problem because they do not 
have any meaning. This can even result in two differ- 
ent AST’s that have the same semantics. Consider Fig- 
ure 3a and the corresponding AST in Figure 3b. Line 1 
in Figure 3a describes the declaration of two users. The 
AST of the code fragment still contains the array which 
is nothing more than syntactic sugar to easily create two 
resources with the same attributes. For the differencing 
algorithm the only difference is the addition of one string 
to an array, instead of adding an entire resource. This 
change is not meaningful and cannot be described cor- 
rectly in a policy. 

The solution is to transform and normalise this tree to 
remove all syntactical structures from the abstract syntax 
tree. In the example from the previous paragraph we can 
remove the array as identifier for the users, and replicate 
the whole definition of the user for each element of this 
array. This transformation ensures that when a user gets 
added or removed, the differencing will detect a user be- 
ing added or removed. The transformed AST is depicted 
in Figure 3c. 

The solution in this example is very specific for the 
given problem and there is no generic solution to remove 
the syntax leftovers in the AST or even to detect them. 
Moreover, we do not have a list of problematic structures 
that are present in the Puppet language. The only solu- 
tion to fully support a language is to test every language 
concept and check the resulting AST structure. In this 
implementation of our authorisation mechanism we ex- 
plicitly chose to transform the AST instead of running a 
preprocessor over the input source. With this approach 
we reuse the existing lexer and parser of the Puppet tool. 
This makes the transformation step less syntax depen- 
dant. If a concept results in an ambiguous AST or one 
that contains syntactical constructs, the AST needs to be 
transformed or the compiler needs to be adapted. 
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1 user {["kwik","kwak"]: 
2 gid => 123 
3 +} 


(a) The Puppet manifest 


class: ASTClass 
- member: Resource 
+ type: Name => user 
+ title: ASTArray 
| + child: String => kwik 
| - child: String => kwak 
- parameter: ResourceParam 
+ param: Name => gid 
- value: String => 123 


(b) The AST created by Puppet 
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class: ASTClass 
+ member: Resource 
| + type: String => user 
| + title: String => kwik 
| - parameter: ResourceParam 
| + param: Name => gid 
| - value: String => 123 
- member: Resource 
+ type: String => user 


10 + title: String => kwak 
11 - parameter: ResourceParam 
12 + param: Name => gid 
13 - value: Name => 123 
(c) The normalised AST 


Figure 3: Puppet configuration that defines multiple 
users using one resource definition. 


The differencing stage compares the two normalised 
AST’s to generate an edit script. In this prototype we 
use the same algorithm as our previous work [10]. This 
algorithm works as follows: 


1. Match the leaves of the two trees using a similarity 
function. 


2. Match the internal nodes using the information of 
already matched leaves: nodes with a lot of leaves 
in their subtrees in common are likely to match as 
well. 


3. Correct wrongly coupled leaves using information 
of the matched internal nodes: parents of matching 
leaves should match as well. 


4. Generate an edit script with the basic changes: add, 
modify and delete. 


5. Correct changes: e.g. remove changes that cancel 
each other. 


4.2 Generating meaningful changes 


Authorisation is enforced based on operations derived 
from the meaning of a change and not on the operations 
the operations in an edit script, therefore the edit script 
is transformed in meaningful changes. For instance, the 
mode parameters of the of the /etc/motd changed from 
0600 to 0644 instead of the 0600 node in the AST was re- 
moved and replaced by the 0644 node. These meaningful 
changes express changes as operations on the concepts 
that exist in the Puppet configuration language, instead 
of operations on a tree. This step is language dependant. 
The edit script expresses operations on the nodes in the 
AST. These nodes in the AST are linked to specific con- 
cepts in the Puppet language. In this step a transforma- 
tion between the operations and the AST nodes and pos- 
sible operations on language constructs is required. In 
our method this is a manually coded step. 


5  Authorising changes 


The second component in our solution is the autho- 
risation of individual configuration changes. For this 
authorisation two elements are needed: a set of poli- 
cies describing which changes are allowed or denied 
and a framework that executes the actual authorisation. 
XACML provides both features and is widely used au- 
thorisation standard in industry. Therefore we used it for 
implementing the authorisation step. In this section we 
will discuss the use of XACML to describe the access 
control policies. 


5.1 The XACML standard 


XACML is a international standard for access control 
and authorisation. The standard defines a language for 
policies and a language for authorisation requests. Both 
are XML based. The standard also describes the compo- 
nents and the architecture of an authorisation engine and 
allows an XACML authorisation engine to be extended. 

XACML defines the components and the dataflow be- 
tween them in the authorisation engine. The following 
components are required to handle an authorisation re- 
quest: 


e Policy Enforcement Point (PEP) This component 
receives the authorisation requests and creates a 
XACML request from it that is sent to the PDP. 


e Policy Decision Point (PDP) The PDP loads all re- 
quired policies and validates the request from the 
PEP against these policies. The results of these 
checks are combined and sent back to the PEP. 


e Policy Access Point (PAP) The PAP makes the 
policies available to the PDP. 
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e Policy Information Point (PIP) The PIP provides 
the PDP with attributes related to subject, resource 
or environment. These attributes can be retrieved 
from several sources such as files or databases. 


XACML is a generic solution for domain specific au- 
thorisation. The domain specific entities involved in the 
authorisation process can be mapped to subject, resource 
and action from the XACML standard. The subject sub- 
mits a request to perform an action on a resource. Each of 
these entities can have multiple attributes. A policy deci- 
sion is based on these attributes and additional attributes 
provided by the PIP. 

The policy contains the rules that define what is al- 
lowed. Policies can be grouped in policy sets and each 
policy set can consist of policies and other policy sets. 
A policy is built from targets, rules, a rule combination 
algorithm and a number of obligations. 


e The target of a policy defines when a policy needs 
to be used. This is expressed using a matching ex- 
pression over the attributes of subject, action and 
resource. 


e Rules have a target that defines when a rule is appli- 
cable, a condition and an effect that defines Permit 
or Deny based on the condition. 


e The combining algorithm determines what the final 
result of a policy is if multiple rules returned an ef- 
fect. 


e The obligation is an action that needs to be executed 
when a policy is applicable. The PEP is responsible 
for executing these obligations. 


The authorisation process works by exchanging re- 
quest and response messages between the PDP and the 
outside world. The request message contains the subject, 
resource, action and environment and the associated at- 
tributes. If the content of the resource is XML it can be 
embedded in the request message. When the PDP has 
calculated the result of the authorisation a response mes- 
Sage 1s sent back. This response message contains a re- 
sult code and an optional message or information. 


5.2 XACML policies for Puppet 


Our authorisation method provides the configuration 
changes to the XACML engine that enforces authorisa- 
tion. This section explains how a policy can reference 
Puppet language constructions in a configuration change. 
This section also provides a design pattern to encapsu- 
late unsupported language constructions to enforce au- 
thorisation on them. The XACML standard describes 
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an XML-based policy language. This language pro- 
vides methods to access and compare attributes of the re- 
source, subject and action involved in the authorisation- 
request. Complex functions can be used to process these 
attributes and to calculate the outcome of the policies. 
An example policy is shown in Figure 4. 

XACML policies need a method for referring to the 
operations an update consists of and to the Puppet lan- 
guage constructs the operations act upon. Because 
XACML is based on XML and the configuration input 
is already available in the form of a abstract syntax tree. 
Additionally a resource can be embedded in a XACML 
authorisation request if the resource is represented in 
XML. Therefore we developed an XML serialisation of 
the Puppet manifests. We based this serialisation on the 
approach of Machination [6] to refer to constructs in the 
input using XPath. The AST is transformed into an XML 
tree that can be referenced uniquely by means of XPath 
expressions. Figures 5a, 5b and 5c show a Puppet state- 
ment and the two representations of the AST. 

XPath queries can refer to the individual elements in 
the XML serialisation of the AST. When the node-id of 
a node is known, this attribute can be used to refer di- 
rectly to this node using an XPath query like //*[@id 
="3"], When referring to the node in function of its at- 
tributes and location the following XPath query can be 
used: //class [Qname="apache"] /* [@type="package"] 

Puppet classes and definitions can be used to create 
abstractions on which access control can be enforced. 
These abstractions can encapsulate the language con- 
cepts our prototype currently does not support or that are 
very hard or impossible to support because of their dy- 
namic nature. We used this design pattern in our evalua- 
tion the create access control policies. Superusers are al- 
lowed to make all changes, including the statements that 
are not supported. These superusers encapsulate these 
statements in definitions and classes that can be used by 
other users. This design pattern matches closely to the 
configuration module approach used by Puppet. These 
modules encapsulate the domain expert knowledge in 
easy to use interfaces and classes. 


5.3. Using external information sources 


XACML can use external sources for information 
through a PIP. In our prototype we extended the XACML 
engine to retrieve external information from directory 
services such as LDAP or active directory. These di- 
rectories contained the roles of each user that can make 
changes. Storing this information in an external source 
and making it available in the XACML engine, makes 
it possible for the XACML policies to be more generic. 
Puppet also supports external sources for retrieving the 
classes that it should assign to hosts. One of such exter- 
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<Policy PolicyId="nodes:apache"> 


1 
2 <Target><Resources><Resource> 

3 <ResourceMatch MatchId="xacml:function:xpath-node-match"> 

4 <AttributeValue DataType="xacm12:data-type:xpath-expression"> 
5 //class[@name="apache"] 

6 </AttributeValue> 

7 <ResourceAttributeDesignator AttributeId="xacml:resource:resource-id" 
8 DataType="xacml12:data-type:xpath-expression" /> 

9 </ResourceMatch> 

10 </Resource></Resources></Target> 

1 <Rule Effect="Permit" RuleId="nodes:apache:webadmin"> 

12 <Target /> 

13 <Condition> 

14 <Apply FunctionId="xacml:function:string-greater-than"> 

15 <AttributeValue DataType="xs:string">xyz</AttributeValue> 
16 <Apply FunctionId="xacml:function:string -one-and-only"> 

17 <SubjectAttributeDesignator DataType="xs:string" 

18 AttributeId="xacml:subject:subject-id"/> 

19 </Apply> 

20 </Apply> 

21 </Condition> 

22 </Rule> 


23  </Policy> 


Figure 4: A sample XACML policy file for configuration changes. 


# Apache-class 
class apache inherits webserver { 
package {"apache": ensure => installed } 


RR Ww NO 


} 
(a) Sample Puppet configuration 


Root 

+ hostclasses: ResourceType () 

| - class: ASTClass (name: apache) 

| + parent: Name () => webserver 

| - member: Resource (title:apache ,type: package) 
| + parameter: ResourceParam (param: ensure) 

| - value: Name () => installed 

- nodes: ResourceType () 


oN DOD NW FW Ye 


(b) The abstract syntax tree 


<Root id=’1’ nodetype=’ASTRoot’ xmlns=’ pupa’> 
<hostclasses id=’2’ nodetype=’ResourceType’ > 
<class id=’3’ nodetype=’ASTClass’ name=’apache’> 
<parent id=’4’ nodetype=’Name’ >webserver</parent > 
<member id=’5’ nodetype=’Resource’ title=’apache’ type=’ package ’> 
<parameter id=’7’ nodetype=’ResourceParam’ param=’ ensure ’> 
<value id=’8’ nodetype=’Name’ >installed</value> 
</parameter > 
</member > 
10 </class> 
ll </hostclasses> 
12 <nodes id=’9’ nodetype=’ResourceType’ > 
13 </nodes > 
14. </Root> 


(c) The XML representation of the AST 
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Figure 5: Puppet configuration file and the resulting AST and its XML representation 
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nal sources is an LDAP directory. This information can 
also be exposed in the XACML engine through a PIP. 


6 Evaluation 


We evaluated our prototype based on two access con- 
trol scenarios. These scenarios each describe a policy 
that has to be enforced. In the evaluation we construct a 
policy-file that tries to accomplish this task and explain 
the reasons behind its structure. We compare the results 
of our policy with a policy based on path based access 
control available in version control systems. The goal of 
these evaluations is to show the possibilities and limita- 
tions of our tool. 

For this evaluation we integrated our prototype into a 
version control system (VCS). The VCS is used as stor- 
age for the configuration files and also acts as the au- 
thorising agent. Additionally a Policy Information Point 
(PIP) provides additional attributes to the XACML en- 
gine. The PIP manages information and is contacted dur- 
ing the authorisation process when specific attributes are 
needed. The PIP in our implementation connects to an 
LDAP server and provides attributes belonging to the ad- 
ministrator issuing the change. 

In the evaluation, two scenarios are investigated: 


1. Only system administrators that are members of the 
webadmin group can configure a machine as an 
apache webserver. 


2. Only system administrators that are members of the 
webadmin group can create a virtual host on an 
apache webserver. Moreover, the documentroot of 
the virtual host can only exist inside the homedirec- 
tory of the user issuing the change. 


6.1 Scenario 1: 
webserver 


configure a machine as 


In the first scenario we have a simple Puppet configura- 
tion that configures nodes as a webserver. In the Pup- 
pet module path an apache module is added with a class 
named apache. The site.pp file contains a list of nodes 
that each have a list of include statements that add func- 
tionality to that server. Figure 6 shows the initial site.pp 
file for this scenario. The change in this scenario config- 
ures the spare server san-jose as webserver by including 
the apache class from the apache module. The security 
policy says that all administrators can add roles to servers 
by including classes, but only users from the group we- 
badmin may configure a server as webserver. 

A SVN repository can be used to limit access to the file 
in the repository. Figure 7 shows a file with access con- 
trol rules for this repository. The puppet user has read- 
only access to the site.pp file in the site directory and 
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has access to the files in the apache module. Only we- 
badmin users can edit the files in the apache module, but 
this does not prevent them from including for example a 
statement that installs a dns server in that module. The 
site.pp file can be edited by all administrators. The ac- 
cess control mechanism from SVN cannot prevent non- 
webadmin users to create new webservers either. 

Figure 8 shows the XACML policy for the ACHEL au- 
thorisation mechanism. This policy allows users from the 
webadmin group to include apache classes in the config- 
uration of a node. This XACML policy provides a very 
fine-grained access control to the statements in the Pup- 
pet manifests. 


node baltimore { 
# wnclude the apache module 
include apache 


node san-jose { 
# spare machine to configure as 
webserver 


1 
2 
3 
4  } 
5 
6 
7 


Figure 6: The site.pp file for Puppet for the first scenario. 
One server is already configured as webserver. The other 
server is kept as spare server. 


1 [L/modules/apache] 
2 puppet = r 

3 @webadmin = rw 

A 

5 L/site] 

6 puppet = r 

7 @admins = rw 


Figure 7: An authorisation file for SVN to restrict access 
to the Puppet manifests in the repository. 


6.2 Scenario 2: add virtual hosts 


The second scenario uses the same apache webserver 
setup and allows users from the webuser group to add 
virtual hosts to the apache configuration. Webusers can 
only add virtual hosts to the configuration from which 
the document root is located in their own home direc- 
tory. The document root parameter controls the direc- 
tory that contains the files that a webserver should serve 
to visitors of a particular domain. The home directory 
path of users is always built as follows: /home/ and their 
username concatenated to that. The Puppet manifest in 
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<Policy PolicylId="nodes:apache"> 
<Target><Resources><Resource> 


</AttributeValue> 
<ResourceAttributeDesignator 
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<ResourceMatch MatchId="xacml:function: xpath -node-match"> 
<AttributeValue DataType="xacml12:data-type:xpath-expression"> 
//p:node/p:include[@class="apache"] 


AttributeId="xacml:resource:resource-id" 
DataType="xacml12:data-type:xpath-expression" /> 


10 </ResourceMatch> 

11 </Resource></Resources></Target> 

12 <Rule Effect="Permit" RuleId="nodes:apache:webadmin"> 

13 <Target><Subjects><Subject> 

14 <SubjectMatch MatchId="xacml:function:anyURI -equal"> 

15 <AttributeValue DataType="xs:anyURI">webadmin</AttributeValue> 

16 <SubjectAttributeDesignator AttributeId="xacm12:subject:role" DataType="xs:anyURI 
" i > 

17 </SubjectMatch> 

18 </Subject></Subjects></Target> 

19 </Rule> 


20 </Policy> 


Figure 8: The XACML policy to allow users from the webadmin group to add the apache class to a node. 


Figure 9 shows a manifest from the apache module that 
configures the virtual hosts in the system. 


The access control configuration for a SVN repository 
for this scenario is similar to the previous scenario. Fig- 
ure 11 shows an updated authorisation file for the SVN 
repository that contains the Puppet manifests for this sce- 
nario. This file limits access to the vhosts.pp file to users 
from the webuser group. It cannot prevent users from 
adding virtual hosts that have a document root in the 
home directory of another user. It also does not prevent 
users from adding virtual host resources to other mani- 
fest files. 


The XACML policy in Figure 10 enforces access con- 
trol based on the contents of the change and not based on 
the file location. It enforces access control on all occur- 
rences of virtual host resources in any Puppet manifest 
file that is included in the repository. The policy builds 
the home directory of the user by concatenating /home/ 
with the username. The value of the documentroot pa- 
rameters of the virtual host resource should always start 
with the home directory string the policy created. 


This policy can be circumvented by using a path that 
starts with the users home directory but then uses .. to 
traverse back to the /home directory. For example, /home 
/lisa/../foo/www. This is not a flaw of the approach but 
a limitation of the expressiveness of the XACML func- 
tions. To close this policy a function that normalises the 
path first before it does the compare is required. In the 
conclusion we will discuss how to counter this attack. 


class apache { 
apache::vhost {"www.example.com": 
docroot => "/home/lisa/www" 


t 


apache::vhost {"photo.example.com": 
docroot => "/home/lisa/photo" 
t 


0 ON Dn Wn FF Ww VY Fe 


Figure 9: The vhosts.pp file for Puppet for the second 
scenario that is located in the /vhosts directory. 


[/modules/apache] 


puppet = r 
@webadmin = rw 


[/vhosts] 


puppet = r 


@webuser rw 
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[/site] 


puppet = r 
@admins = rw 


= — 
- OS 


Figure 11: An update authorisation file for SVN for the 
second scenario. 


6.3. Conclusion 


The conclusions from this evaluation are twofold. First, 
there is a strong need for content-aware authorisation. 
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<Policy PolicyId="apache:webuser"> 


1 
2 <Target><Subjects><Subject> 

3 <SubjectMatch MatchId="xacml:function:anyURI -equal'"> 

4 <AttributeValue DataType="xs:anyURI">webuser</AttributeValue> 

5 <SubjectAttributeDesignator 

6 AttributeId="xacml2:subject:role" DataType="xs:anyURI" /> 

7 </SubjectMatch> 

8 </Subject></Subjects></Target> 

9 <Rule Effect="Permit" RuleId="apache:webuser:vhost'"> 

10 <Description>Add or remove a vhost</Description> 

11 <Target><Resources><Resource> 

12 <ResourceMatch MatchId="xacml:function: xpath -node-equal'"> 

13 <AttributeValue DataType="xacm12:data-type:xpath-expression"> 
14 //pup: *[@type="apache::vhost"] 

15 </AttributeValue> 

16 <ResourceAttributeDesignator AttributeId="xacml:resource:resource-id" 
17 DataType="xacml12:data-type:xpath-expression" /> 

18 </ResourceMatch> 

19 </Resource></Resources></Target> 

20 </Rule> 

21 <Rule Effect="Permit" RuleId="apache:webuser: vhost -docroot"> 

22 <Target><Resources><Resource> 

23 <ResourceMatch MatchId="xacml:function:xpath-node-match"> 

24 <AttributeValue DataType="xacm12:data-type:xpath-expression"> 
25 //p:*[Qtype="apache::vhost"]/p:parameter [@param="docroot"] 

26 </AttributeValue> 

27 <ResourceAttributeDesignator AttirbuteId="xacml:resource:resource-id" 
28 DataType="xacml12:data-type:xpath-expression" /> 

29 </ResourceMatch> 

30 </Resource></Resources></Target> 

31 <Condition><Apply FunctionId="thesis:function:string-starts-with"> 
32 <Apply FunctionId="xacml:function:string-one-and-only"> 

33 <AttributeSelector DataType="xs:string" 

34 RequestContextPath="//p:param[@param=’docroot’]/p:value/text()" /> 
35 </Apply> 

36 <Apply FunctionId="xacml12:function:string-concatenate"> 

37 <AttributeValue DataType="xs:string">/home/</AttributeValue> 
38 <Apply FunctionId="xacml:function:string-one-and-only"> 

39 <SubjectAttributeDesignator DataType="xs:string" 

40 AttributeId="xacml:subject:subject-id" /> 

41 </Apply> 

42 </Apply> 

B </Apply></Condition> 

44 </Rule> 


45 </Policy> 


Figure 10: The XACML policy that only allows users from the webuser group to add vhosts with a document root in 


their homedirectory. 


Our prototype is able to provide this by analysing the 
changes made to a configuration file and deriving the ac- 
tions from it that need to be authorised. The prototype is 
flexible enough to do this in a very fine-grained manner. 
Using the ACHEL method changes to Puppet manifests 
are authorised at the level of instantiating resources and 
authorising them based on the parameters and the scope 
they are declared in. 

Second, our prototype is not yet fully finished. It is 
possible to circumvent the authorisation using simple at- 
tacks. This is not a limitation of our approach but of 
the XACML language expressiveness which results in a 
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policy that is not fully closed. A possible solution for 
this type of attacks is extending the standard XACML 
function-space to include domain-specific helper func- 
tions to create a fully closed access control policies. In 
our prototype we used the enterprise-java-xacml [12] 
XACML engine. Adding functions to this engine is easy 
as adding an annotation to a Java class and the method 
that implements the XACML function. 
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7 Future work 


We added basic support for our authorisation mechanism 
to Puppet. Future work in this direction should focus on 
extending support for the Puppet language and on inte- 
grating update workflows in the authorisation phase. 


Extending Puppet support Currently Puppet support 
is limited to a subset of the Puppet language. This subset 
provides a usable implementation, especially using the 
design patterns we described. Our prototype can be ex- 
tended to support additional language constructs such as 
calling functions or exporting resources. 


Integrating workflow In our original paper [10] we 
also integrated workflow support. This support is orthog- 
onal to extracting meaningful changes from changes in 
Puppet manifests. Therefore we did not implement this 
in this prototype. This workflow support is based on digi- 
tal signatures on revisions in the version repository. This 
information can be made available to the XACML en- 
gine through a PIP. This should be sufficient to add the 
workflow support to this prototype. 


Ownership information Our ACHEL method can 
also track ownership information based on the changes 
made to the configuration files. With this ownership 1n- 
formation a policy could also include that person A can- 
not change any parameters or resources that are owned 
by person B. To derive ownership information we need 
to start from the first revision and determine for each 
change what the impact is on the ownership of a state- 
ment. For Puppet one of these questions is who is the 
owner of a resource? Is this the user that gave the name 
to the resource? If parameters are changed, how does the 
ownership of the resource and the parameter change? 


$ Conclusion 


In this paper we developed a prototype that authorises 
changes to the Puppet input language based on their 
meaning. It derives the operations that need to be au- 
thorised from the changes to the input. This prototype 
extends our previous work by applying it to a real sys- 
tem configuration tool with a complex input language. 
These changes are authorised using XACML which is a 
widely adopted industry standard for access control and 
authorisation, instead of using regular expressions. 

In our previous work we applied our approach to a 
simple configuration language. The results from this 
work show that although the configuration language is 
more complex, it is possible to extract the individual 


meaningful configuration changes. The complete lan- 
guage is not yet fully supported, and the prototype needs 
more work to analyze the difficult language constructs 
before it can be used in production. Our claim from 
our previous paper that the method is language agnos- 
tic holds, on the condition that the method can start from 
a clean AST. The usage of the XACML standard for au- 
thorisation provides a lot of flexibility for writing poli- 
cies, as well as extensibility and integration of other in- 
formation sources. This prototype does not include the 
workflow enforcement of our Achel [10] prototype but 
can be easily added to the XACML engine by including 
a PIP that provides the signature information required to 
enforce workflows. 

This implementation uses XACML as authorisation 
language. In our first prototype we used a regular ex- 
pression based language we developed. XACML pro- 
vides an authorisation engine that is the de facto industry 
standard in contrast to our own authorisation language. 
The usability of our regular expression based language 
was also very poor. Unfortunately it appears to be hard 
to write policies in XACML as well. Luckily tooling 
support exists for writing XACML policies to improve 
usability. 

To conclude the main contributions of this paper are: 
First of all the identification of the difficulties and pos- 
sibilities to extract meaningful changes from a complex 
configuration language such as Puppet. Second the de- 
velopment of a set of rules that describe how to write a 
policy and refer to the configuration elements in these 
policies. 
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ABSTRACT 


Authentication is of paramount importance for all modern networked applications. The 
username/password paradigm is ubiquitous. This paradigm suffices for many applications that 
require a relatively low level of assurance about the identity of the end user, but it quickly breaks 
down when a stronger assertion of the user’s identity is required. Traditionally, this is where two- 
or multi-factor authentication comes in, providing a higher level of assurance. There is a multitude 
of two-factor authentication solutions available, but we feel that many solutions do not meet the 
needs of our community. They are invariably expensive, difficult to roll out in heterogeneous user 
groups (like student populations), often closed source and closed technology and have usability 
problems that make them hard to use. In this paper we will give an overview of the two-factor au- 
thentication landscape and address the issues of closed versus open solutions. We will introduce a 
novel open standards-based authentication technology that we have developed and released in 
open source. We will then provide a classification of two-factor authentication technologies, and 
we will finish with an overview of future work. 
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1 INTRODUCTION 
1.1 AUTHENTICATION 


Authentication is something we all do every 
day. And whether it is to log in to our e-mail ac- 
count, to access our Facebook page or to tweet 
about that cool new album we've just bought, the 
use of username/password is by far the dominant 
paradigm. 


There are - of course - applications that re- 
quire a higher level of assurance such as electronic 
banking. The traditional approach for achieving this 
higher level of assurance is to use multi-factor au- 
thentication (also referred to as strong authentica- 
tion). There is a multitude of multi-factor authenti- 
cation solutions on the market. Traditionally, this 
market has a strong tendency towards closed solu- 
tions with strong vendor lock-in. This invariably 
leads to a high cost per user, hampering the wide- 
scale rollout of multi-factor authentication technol- 
ogies. Another common limitation of current multi- 
factor authentication technologies is the fact that 
they are often single-purpose solutions (e.g. they 
can only be used for one bank). Furthermore, there 
are serious usability issues with many multi-factor 
solutions that make it difficult to enforce their use in 
most communities. 


Exactly what an acceptable level of assurance 
is should not only be decided by a service provider; 
users may also have an opinion on this. Almost all 
solutions currently on the market give very little 
control to the end user. 


1.2 RECENT INDUSTRY DEVELOPMENTS 


In recent years there have been some promis- 
ing developments in the industry. In 2004, the Initi- 
ative for Open Authentication (OATH, [1]) was 
formed. The intention of this initiative is to create 
an industry-wide reference architecture for strong 
authentication. The OATH initiative has been very 
successful in creating industry standards for two- 
factor authentication that have been embraced by 
the Internet community in the form of IETF stand- 
ards ([2], [3], [4]). A number of companies and open 
source initiatives have adopted these standards in 
products and services (see also section 3). 


Other developments have underlined the need 
to adopt open standards in the security and authen- 
tication/identity management industry. Time and 
again closed solutions and algorithms have been 
shown to be vulnerable to attack because of the lack 
of peer review (poignant examples include the 
MIFARE system and the GSM A5/1 cryptographic 
algorithm [5]). 


Finally, large players on the Internet have re- 
cently introduced two-factor authentication for 
some of their services ([6], [7]). This is the first time 
two-factor authentication is deployed on a (poten- 
tially) large scale for applications outside the finan- 
cial industry or enterprise domain. 


1.3 OVERVIEW OF THIS PAPER 


In this paper we aim to give an overview of the 
current two-factor authentication landscape in sec- 
tion 2. In section 3, we will further clarify some of 
the issues we believe exist in current two-factor 
authentication market offerings. Section 4 proposes 
a way to classify authentication solutions and con- 
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tains a classification of the solutions discussed in 
section 2. In section 5, we will introduce the innova- 
tive two-factor authentication solution we have 
developed which is based on open standards and 
open technology. Section 6 revisits the classification 
proposed in section 4 and adds a classification for 
the solution we introduced in section 5. Finally, in 
section 7 we will draw conclusions and provide 
suggestions for future work. 


2 THE TWO-FACTOR LANDSCAPE 
2.1 INTRODUCTION 


In this section we aim to give an overview of 
the two-factor landscape. Before we do that, we will 
first give a definition of what we think constitutes 
two-factor authentication. 


We then describe the solutions currently on 
offer, which we divide into two categories: 


° Traditional solutions - these rely on single 
purpose (i.e. only used for identification) 
hardware devices or on a unique quality of the 
user (i.e. a biometric) 

° Hybrid solutions - these rely on non-single 
purpose devices owned by the user, possibly 
in combination with software running on the- 
se devices 


2.2 DEFINITION OF TWO-FACTOR' AU- 
THENTICATION 


In this paper, we define two-factor authentica- 
tion as a means of authentication relying on the user 
demonstrating at least 2 separate factors from the 
following list: 


° Something the user knows (e.g. a PIN code ora 
password) 

° Something the user has (e.g. a hardware to- 
ken) 

° Something the user is (e.g. a biometric, such as 


a fingerprint) 


Solutions that we place in the “hybrid” category rely 
on something the user has but where there is a 
chance of this factor being duplicated as could - for 
instance - be the case with a soft token running as 
an application on a smartphone. Some people in the 
blogosphere have coined the term “1.5 factor au- 
thentication” for this category (e.g. [40]) 


In this paper we will refer to a solution as two- 
factor authentication whenever the device on which 
the user is authenticating is physically separate 
from whatever constitutes the second factor (e.g. a 
soft token on a phone is only a second factor if it is 
used for authenticating a session on a separate de- 
vice such as the user’s computer). 
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2.3 TRADITIONAL SOLUTIONS 
2.3.1 OTP TOKENS 


One-Time Password or OTP tokens are devices 
that generate single-use passwords (often com- 
posed of strings of up to 10 digits). There are two 
variants: time-based tokens - these generate a new 
password at regular intervals (e.g. every 30 se- 
conds) and event-based tokens - these generate a 
new password after a user intervention (e.g. push- 
ing a button on the device). 


The second factor most often combined with 
these devices is either a password that is entered on 
the user’s computer or a PIN that is entered on the 
token device itself. 


OTP tokens rely on symmetric cryptography 
for their operation; they contain some secret that is 
securely stored in the device, which can never leave 
it. The same secret is also known on the server that 
validates the user’s credentials when they log in. 


Examples of OTP tokens include: VASCO 
Digipass [10], RSA SecurID [11] and Feitian OTP 
Tokens [12]. 


RSA SecurID® 





Figure 1 - example of an OTP token (RSA SecurID) 


2.3.2 CHALLENGE/RESPONSE TOKENS 


Challenge/response tokens are similar to OTP 
tokens in that they also rely on symmetric cryptog- 
raphy for their operation. Some OTP tokens also 
have challenge/response capabilities. 


Whereas OTP often suffices for simple authen- 
tication, challenge/response tokens are mainly used 
for transaction authentication such as, for instance, 
approving a money transfer. This is achieved by 
having the user enter one or more sequences of 
digits on the token (the challenge) and using these 
as input for a cryptographic algorithm to produce 
another sequence of digits (the response) that the 
user then returns to the party requesting authenti- 
cation. 


Challenge/response tokens are usually pro- 
tected using a PIN code as the second factor. 


Examples of challenge/response tokens in- 
clude VASCO DigiPass [13], SafeNet SafeWord GOLD 
[14] and Feitian Challenge/Response tokens [12]. 
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Figure 2 - example of a challenge/response token 
(SafeWord GOLD) 


2.3.3 PKI TOKENS 


In contrast to the previous two solutions, PKI 
tokens rely on public key cryptography. 


Under the hood, almost all PKI tokens rely on 
smart card ICs with a cryptographic co-processor 
capable of performing public key operations and - 
in most cases — key generation. They come in a vari- 
ety of form factors, the two most common being the 
smart card and the USB dongle. 


Authentication with PKI tokens usually relies 
on some form of challenge/response algorithm. The 
aim of these algorithms is to prove that the user is in 
possession of the private key belonging to a public 
key that is usually stored in an X.509 certificate (for 
more details, see [5] sections 3.2 and 4). 


Contrary to the previous two solutions, PKI 
tokens usually interface with the end user system. 
They rely on software running on that system to 
integrate with, for example, the browser and mail 
client. There is an exception to this rule: Mobile PKI 
(see [8]). In Mobile PKI, the user’s SIM card is used 
as a PKI token; interfacing with the token takes 
place using special SMS text messages. 


PKI tokens have a broader applicability than 
just authentication. They can also be used to create 
advanced - and in some jurisdictions legally binding 
— digital signatures (for more information, see [5] 
section 4.9). 


2.4 HYBRID SOLUTIONS 
2.4.1 SMS OTP 


For many years now, the fact that almost eve- 
ryone has a mobile phone is being used as a means 
for two-factor authentication. Many users will be 
familiar with SMS One-Time Passwords. 


SMS OTP relies on an authentication server 
sending one-time passwords by SMS text message to 
the user. The user’s mobile phone is thus leveraged 
as an authentication factor. The other factor is 
commonly username/password (thus the user first 
logs in using username/password and then pro- 
vides additional proof of his or her identity using 
SMS OTP). 


There is some discussion about whether SMS 
OTP constitutes real two-factor authentication 
({15], [16]). Especially the fact that it is hard to pro- 
tect the user against (temporary) stealing of their 
phone is a concern (putting a PIN lock almost never 
provides protection since SMS’s are displayed even 
if the handset is locked). 


There are many vendors of SMS OTP services; 
a Google search for “SMS OTP” will produce a long 
list. 


2.4.2 OTP APPS 


Another more recent development is the ap- 
pearance of One-Time Password Apps. These run on 
modern handsets (smart phones) and usually mimic 
the behaviour of OTP tokens (see section 2.3.1). The 
difference between these Apps and ‘real’ OTP tokens 
is that the secret is stored and processed in software 
on the handset. This makes them somewhat more 
vulnerable to attacks. 


Most OTP token vendors now also have App 
versions of their OTP tokens that interface with the 
same backend server systems that are also used for 
their hardware tokens. 


3 ISSUES IN TWO-FACTOR AUTHEN- 
TICATION 


3.1 INTRODUCTION 


We feel that there are several issues surround- 
ing two-factor authentication that are hampering 
rollout on a larger scale; most solutions are closed, 
they often use single-purpose tokens, are not easy to 
use, may have prohibitive costs associated with 
them and almost always lack user control. We will 
address these issues in more detail in the remainder 
of this section. 


3.2 CLOSED SOLUTIONS 


The most important issue with most current 
solutions is that they are closed ecosystems. For 
example, the majority of OTP tokens is based on 
proprietary algorithms and can only be integrated 
into applications by using servers or server-side 
components supplied by the token vendors. 


Ironically, for PKI tokens it is even worse. 
They always require integration software on the 
client system in the form of cryptographic middle- 
ware (although they normally do not require server- 
side integration, since they are based on built-in 
X.509 client authentication). If the tokens are smart 
cards, they require smart card readers (which are 
not commonly installed in systems apart from some 
enterprise-market laptops). And both smart card 
readers as well as USB tokens may require specific 
drivers before they will work although that is less 
common nowadays with most of them supporting 
the CCID [17] standard. 
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Because most solutions require proprietary 
software, they are not easily integrated on all plat- 
forms (i.e. they will only work on vendor-supported 
platforms). 


For OTP tokens, the advent of the OATH initia- 
tive brings hope since both the algorithms in the 
tokens as well as the way that the token secrets are 
distributed are now specified in open standards. 
This makes it possible to develop the server-side 
integration software independent of the token ven- 
dor and allows these components to support tokens 
from many vendors. There is already quite a bit of 
uptake among token vendors. 


In contrast, for PKI tokens, the situation is dif- 
ferent. Although there is an open source initiative 
[21], this project has not really seen a wide use or 
deployment and indeed most PKI token middleware 
is still proprietary and closed. On a positive note, 
PKI middleware at least adheres to the open PKCS 
#11 standard [22]. 


One final thing to mention is Mobile PKI. From 
an integration perspective it is fully open, because it 
is based on an open standard web service interface 
called MSSP [23], [24], [25], [26]. The downside is 
that a special application needs to be installed on 
the user’s SIM card. The mobile operator owns the 
SIM card and access to it is strictly guarded. This 
means that in order to be able to deploy Mobile PKI 
co-operation of the mobile operator is required, 
which has been proven to be difficult on many occa- 
sions. 


3.3 SINGLE PURPOSE TOKENS 


Almost all OTP tokens are single-purpose to- 
kens by nature because they rely on a shared secret. 
The tokens themselves can only contain one secret, 
which means that they can only be paired with one 
server. Unless the server is used as an authentica- 
tion service for multiple applications (which is very 
rarely the case), the tokens can thus only be used for 
a single purpose (e.g. to log in to online banking for 
a single bank). This is very inconvenient for users, 
and indeed many users will know the hassle of hav- 
ing more than one token because they are custom- 
ers at more than one bank. 


In principle, PKI tokens should be more flexi- 
ble because they often support storage of more than 
one X.509 certificate together with the associated 
key-pair. Unfortunately, the issuance process of PKI 
tokens is usually such that users have no control 
over the content of their token and can very rarely 
add credentials for additional identities. Thus, PKI 
tokens can only be used for multiple purposes if 
they contain an identity issued by a Certificate Au- 
thority that is supported by the party to which the 
user is authenticating. 
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In theory, mobile App-based solutions can 
more easily support multi-purpose deployments, in 
practice this does not happen very often yet. 


3.4 (LACK OF) EASE OF USE 


Many users will have experienced how diffi- 
cult it can be to use OTP tokens. Most of them re- 
quire typing in complicated codes. The chal- 
lenge/response variety is even more complicated 
where users regularly have to type multiple codes 
on the token and then they have to copy the result 
from the token by typing it on the site they are au- 
thenticating to. 


SMS OTP is no better. In fact, it is even more 
complicated in our opinion as the one-time pass- 
words used often consist of both capitals and lower 
case letters as well as digits and punctuation marks. 


PKI tokens fare a little better. As long as the 
software integration with the user’s browser is 
properly installed, the user experience is usually 
quite smooth. 


A common issue shared by all tokens except 
mobile phone-based ones is that they are all too 
easy to forget or lose. 


3.5 COST 


Both OTP and PKI tokens can be quite costly, 
both in initial investment as well as yearly licence 
fees. It is not uncommon to pay tens of US dollars 
per user per year. SMS OTP becomes gradually more 
costly the more it is used. 


The only exception to this rule is a new class of 
OTP tokens that are emerging, based on open stand- 
ards developed by the OATH initiative. Because they 
work with open source software, the only substan- 
tial cost is the initial investment. Yubikey tokens 
[27], for example, can be purchased for less than 
USD $30 and the price goes down for larger volume 
purchases. 


3.6 (LACK OF) USER CONTROL 


Users seldom initiate deployment of two- 
factor authentication solutions. They are usually 
deployed by corporate IT departments or banks. 
The organisations deploying these tokens strictly 
control what they can or cannot be used for, severe- 
ly limiting users. 


It is very hard for users to acquire personal 
two-factor tokens and deploy them in a useful way 
because very few services provide the means to self- 
enrol identities. A notable exception to this is Google 
Authenticator [28]. 
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4 CLASSIFICATION OF AUTHENTI- 
CATION SOLUTIONS 


4.1 INTRODUCTION 


In this section we will introduce six different 
ways to classify authentication solutions in order to 
judge their suitability. We will use this classification 
at the end of this section to classify the two-factor 
authentication solutions discussed earlier. 


4.2 HARDWARE INDEPENDENCE 


The first way to classify authentication solu- 
tions is by their dependence (or lack thereof) on 
specific or specialised hardware for their operation. 


We feel that hardware independence enhances 
the usability of a solution, because the more inde- 
pendent a solution is from specific hardware, the 
fewer devices a user has to carry around. 


From a security perspective, however, using 
special purpose-made hardware has distinct ad- 
vantages. Devices can be tailored for one goal, which 
is to protect the secrets associated with a user’s 
credentials. 


In this paper, we will focus mainly on the en- 
hanced usability that comes with hardware inde- 
pendence; we will factor in the security advantages 
that special hardware can offer when we judge the 
security of a solution. We rank solutions that offer 
stronger hardware independence more favourably 
than solutions that require specific hardware to 
operate. 


4.3 SOFTWARE INDEPENDENCE 


Just like hardware independence, software in- 
dependence is mainly a usability enhancing aspect. 
In some cases, dependence on specific hardware 
goes hand in hand with dependence on specific 
software. For example, smart cards cannot operate 
without the accompanying security middleware that 
users will have to install on their computer. 


Some solutions only depend on specific soft- 
ware on the server side and do not require the user 
to install software (for example OTP tokens). 


We will judge solutions on the amount of effort 
required to install software by both end users as 
well as by the system administrators of the server 
side. We will also factor in the availability of integra- 
tion in off-the-shelf products as this can significantly 
reduce the effort required to install the required 
software. 


4.4 SECURITY 


Security is - of course - one of the most im- 
portant factors when judging authentication solu- 
tions. 


There are several aspects that influence the 
security of a solution: 


° Is the solution a multi-factor solution? If so, is 
it a true multi-factor solution (see §2.3) or a 
hybrid solution (see §2.4)? 

° Does the solution rely on purpose-built hard- 
ware that has provisions for e.g. tamper re- 
sistance? 

° Are there well-known attacks that (severely) 
impact the security? 

° If the solution relies on cryptography, does it 
rely on sufficiently strong as well as open 
cryptography? 

° Has the security of the solution been verified 
by reputable independent security auditors? 


4.5 COST 


Cost is an important factor, especially for 
large-scale deployments. It can be considered from 
a number of different angles: 


° The one-time setup cost (e.g. in software and 
hardware purchases) and recurring cost of the 
actual solution (e.g. yearly licence fees). 

° The cost for troubleshooting for users who 
have misplaced their credentials or forgotten 
their password or PIN. 

° The cost of integrating the solution into exist- 
ing IT infrastructure (what skill level is re- 
quired and how much time do system adminis- 
trators or system integrators spend setting up 
the solution). 


4.6 OPEN STANDARDS COMPLIANCE 


Open standards form the backbone of the In- 
ternet. Vendors implement these standards that are 
available free-of-charge or for a reasonable fee to 
guarantee interoperability with systems from other 
vendors. 


There is a whole host of open standards in the 
authentication arena that make it easier to integrate 
solutions into existing IT infrastructure. They also 
offer a certain level of vendor independence as one 
solution can be more easily exchanged for another. 
Of course, this also depends on the level to which 
open standards have been integrated. For example: 
OTP tokens that fully support the open standards of 
the Open Authentication Initiative can easily be in- 
tegrated with server-side software from a range of 
vendors that support these standards. On the other 
hand, PKI tokens that rely on PKCS #11 middleware 
are less easily replaced by another solution as they 
will require specific middleware supplied by the 
token vendor. 


For a long time supporting open standards 
was not common practice, especially among OTP 
token vendors. Fortunately, this is now changing for 
the better with the advent of consortia like the Open 
Authentication Initiative. 
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4.7 EASE-OF-USE 


A final factor that can go a long way in deter- 
mining the success of a solution is ease-of-use. At 
first glance, solutions that are already familiar to a 
user - such as username/password - may seem 
easy-to-use. But when all the kludges that have been 
added to enhance the security such as complex 
password policies and requirements to change 
passwords on a regular basis are considered, it is 
easy to see that such solutions may not be as easy- 
to-use as initially assumed. 


Other things that need to be factored in when 
considering the ease-of-use of a solution are: 


° Does the solution require the user to carry 
around additional devices (that he/she other- 
wise would not need to operate their comput- 


er)? 

: Does the user have to re-type complicated 
codes (such as may be the case for OTP to- 
kens)? 

° Has care been taken to design the user experi- 


ence such that the solution can be used intui- 
tively by the user rather than requiring them 
to learn how to operate the solution from e.g. a 
manual? 


4.8 CLASSIFICATION 


Table 1 below shows the scores we have as- 
signed to each solution described in section 2 for 
each of the 6 different classification categories de- 
scribed earlier; we used a five point scoring system 
ranging from ++ (indicating that a solution is (one 
of) the best in class for the given classification cate- 
gory) to -- (indicating that a solution has very unfa- 
vourable characteristics compared to other solu- 
tions in the given classification category). Any scor- 
ing system is, of course, subjective; we endeavour to 
justify the scores in Table 1 in section 4.9. 


Hardware Software Open 


feel tT feta tet 
emo | - | - | | |e | + 


Paton | = [= | [= | = 
bonus [+ [+ [++ [7] +] 
som | + [= | - |---| -— 
rem | + | +/= | + 1 4/=1+/= 1 = 


Table 1 - Classification of authentication solutions 





4.9 JUSTIFICATION 


We would like to highlight certain points of the 
classification we made in the previous section. Giv- 
en the endless stream of news articles about 
username/password getting compromised we feel 
that - even though it is tried and tested - this para- 
digm is really lacking in security. And even if organi- 
sations enforce secure password policies and users 
adhere to them, they may still be at risk. Recent 
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developments in password cracking such as using 
GPU-based cracking systems make the security of 
any password under a certain length questionable 
[45]. With the increasing value that online identities 
have (how would you feel if your GMail, your Face- 
Book or your Twitter account got compromised and 
someone reads your private data or tries to imper- 
sonate you?) we, as authors, are of the opinion that 
two-factor authentication should become much 
more common than it is now. 


As the classification shows, to get rock solid 
security using two-factor authentication we feel that 
a real purpose-built hardware token should be used. 
Nevertheless, emerging solutions that rely on mo- 
bile phones as personal devices, such as OTP Apps, 
show great promise. If implemented properly, these 
solutions can add significant value security-wise. 


There are three key problems currently inhib- 
iting wide-scale deployment of two-factor authenti- 
cation outside of the corporate and banking envi- 
ronment. The first is cost; OTP and PKI tokens are 
expensive (there are exceptions: interestingly, one 
of the largest deployments of OTP tokens is for 
online World-of-Warcraft [41]). The second is de- 
pendence on bespoke hard- and software. Especially 
PKI tokens suffer from the problem that they re- 
quire the end-user to install driver software and 
security middleware that is not always available for 
all end-user platforms. 


Finally, the last problem is the lack of adher- 
ence to open standards. This not only stops people 
integrating support for two-factor authentication 
into their online services, it also means that many 
two-factor products are single purpose only (e.g. a 
token issued by a bank cannot be used to authenti- 
cate for other services). 


We have tried to let these three problems be 
reflected in the classification given in Table 1. 


5 tiqr: EXAMPLE OF AN OPEN AP- 
PROACH 


5.1 INTRODUCTION 


In 2009 we experimented with Mobile PKI 
(see also §2.3.3) as a means of authentication. As the 
report [8] of our experiment shows, we were very 
happy with the results. The technology is user- 
friendly, very secure and - because of the open 
standards it is based on - easy to integrate. 


The only major hurdle we encountered is the 
dependence on mobile operators. These operators 
are very hesitant about deploying the technology 
because it requires a SIM swap (most SIMs deployed 
in The Netherlands are not PKI capable), and be- 
cause they do not feel that there is a strong business 
case to deploy the technology in terms of potential 
revenue from it. 
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As operator of the National Research and Edu- 
cation Network (NREN) in The Netherlands, 
SURFnet operates a so-called identity federation 
(see [29]) called SURFfederatie. This federation ena- 
bles users to log in at a multitude of online service 
providers using a single identity hosted by their 
home institution. Furthermore, this federation of- 
fers users single sign-on. 


As is the case on most of the Internet, almost 
all authentications in the SURFfederatie rely on the 
tried and tested username/password mechanism. 
We would like to improve on this situation by intro- 
ducing alternative means of authentication based on 
two-factor authentication technology. There are two 
reasons for this: first, we feel that some services 
require a stronger form of authentication than 
username/password. Secondly, we would like to 
offer users a safe alternative that they can use on 
untrusted systems such as, for instance, computers 
in Internet cafés. 


SURFfederatie has a sizable and very hetero- 
geneous user population consisting of approximate- 
ly one million students, researchers and other staff 
from over a 160 different institutions. It would be 
impossible to deploy a token-based two-factor au- 
thentication solution because of the logistics in- 
volved. It would, however, be ideal if we could de- 
ploy a secure two-factor authentication system that 
uses mobile phones. Almost everyone owns a mo- 
bile phone (in fact, in The Netherlands, a country of 
16.5 million people, there are over 19 million active 
mobile subscriptions [30]) and users are very moti- 
vated to carry their mobile phone at all times [31]. 


For reasons mentioned before, we could not 
rely on Mobile PKI so we started searching for an 
alternative. The criteria for this alternative were 
that it should be secure, user-friendly, easy to de- 
ploy, open and suitable for managing multiple iden- 
tities. We believe that we have developed a novel 
solution that meets all of these criteria. 


5.2 THE CONCEPT 
5.2.1 BASIC FEATURES USED 


The concept we call tiqr is based on three fea- 
tures of modern smartphones: 


° The ability to run Apps 
° A camera 
: Internet connectivity 


5.2.2 QR CODES 


Relying on these smartphone features allows 
tiqr to make use of two-dimensional barcodes called 
QR codes. They were invented by Toyota subsidiary 
Denso-Wave in the 1990s. 


i 1. Version information 


OO il 2. Format information 


3. Data and error correction keys 






i, =a 4. Required patterns 


[a] 4.1. Position 
[-] 4.2. Alignment 


a 
a 4.3. Timing 
a 

[] 5. Quiet zone 


Figure 3 - a QR code with specific features highlighted 
(source [9]) 


Although patented, QR codes can be used roy- 
alty free. The technology behind the codes has been 
standardised as ISO/IEC 18004:2006. Up to 4KB of 
alphanumeric data can be stored in the codes and 
numerous libraries are available that can extract 
information contained in a QR code from images 
captured by a camera. For more details about QR 
codes, we refer readers to the excellent Wikipedia 
article [9]. 


QR codes have become quite popular, because 
most phones are equipped with cameras and can 
run QR code reader software. The codes are almost 
exclusively used in a static fashion, for instance in 
advertising or on public transport stops. They usual- 
ly contain an encoded URL that QR code readers can 
open in a mobile browser. 


The innovation we have come up with is to use 
QR codes in a dynamic rather than a static fashion. 
By encoding a challenge in a dynamically generated 
QR code that is displayed to the user when he/she 
wants to log in, we use QR codes to take away the 
burden on users of typing challenge/response 
codes. QR codes are also used during enrolment to 
tie the user’s phone to an identity. Although this 
solution is not unique - the Google Authenticator 
App [28] can use a QR code to convey the user se- 
cret during enrolment - we have taken this technol- 
ogy further by creating a seamless user experience. 


5.2.3 THE TIQR USER EXPERIENCE 


To illustrate how tiqr works, we will go 
through the tiqr user experience during authentica- 
tion (assume for now that a user already has a tiqr- 
enabled account). 


The flow starts by a user surfing to a website 
that requires them to log in. Where most sites would 
display a username/password dialog (or an entry 
field to enter a one-time password), with tiqr users 
will see a QR tag as shown in Figure 4. 
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Figure 4 - tiqr login page showing a QR code 


Contained in the QR code is a challenge. The 
user now launches the tiqr App on their 
smartphone. The App will activate the camera al- 
lowing the user to scan the QR code. 
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Figure 5 - the user scans the QR code with the tiqr App 


Apart from a random challenge, the QR code 
also contains information on the relying party re- 
questing authentication. The App can manage mul- 
tiple identities and will select an appropriate identi- 
ty that can be used to log in to this particular site (if 
multiple identities are present, the user will see a 
list and can choose the appropriate one). The tiqr 
App now asks the user to confirm that he/she wants 
to log in, also displaying the domain name of the site 
they are logging in to in order to reduce the risk of 
phishing. 
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Figure 6 - tiqr asks for user confirmation 


Once the user has confirmed their identity, 
they will be asked to enter their PIN code (the se- 
cond factor). 
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Figure 7 - user entering their PIN 


The user is helped in remembering his or her 
PIN by means of animal icons displayed in the PIN 
entry dialog. Errors made during PIN entry (such as 
swapping two digits or a completely different PIN) 
will lead to a different sequence being displayed. 
When the user presses OK, login will proceed. If the 
user’s phone is online, the Internet connection of 
the phone will be used to submit the response to the 
authenticating server thus obviating the need to 
type one-time passwords in to a website. When au- 
thentication is successful, the user is notified both 
on the phone as well as by the website proceeding 
with login by redirecting the user to the protected 
content as shown in the screenshot (Figure 8). 
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Figure 8 - the user has successfully logged in 


In case no Internet connection is available on 
the phone a fall-back scenario is used where a one- 
time password is displayed on the phone for the 
user to type into the website (more on this in sec- 
tion 5.8). 


5.2.4 FROM PROOF-OF-CONCEPT TO PRODUCT 


We first came up with the concept that led to 
the development of tigr in September 2010. In order 
to prove that the concept would work, we designed 
the initial protocol and developed a_proof-of- 
concept implementation, both of the server side as 
well as of the phone side. For the proof-of-concept 
an implementation was created for Apple’s iOS plat- 
form. 


The proof-of-concept quickly showed that the 
technology worked very well. We first demonstrat- 
ed the working proof-of-concept at an event held 
every two years to showcase SURFnet innovations 
to our connected institutions in December 2010 and 
received helpful and positive feedback from the 
people attending. This led us to decide that we 
should continue development. 


In April 2011 we released the first Apple iOS 
production version in the Apple App Store and we 
presented on the project at the Internet2 Spring 
Member Meeting in Arlington, VA. The Android ver- 
sion was released in May 2011 just before we pre- 
sented on further improvements to tigr at the 
TERENA Networking Conference 2011 in Prague, 
Czech Republic. 


The remainder of this section will go into more 
detail about the tiqr technology. 


5.3 MOBILE APPS 
5.3.1 PLATFORMS 


We wanted to make tigr available on the two 
most common smart phone platforms. According to 


a Q1 2011 market survey, those platforms are Ap- 
ple’s iOS and Google’s Android platform: 
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Figure 9 - smart phone market, source: The Guardi- 
an /Kantar 


We have developed Apps for both these plat- 
forms. The Apps rely on the excellent ZXing QR code 
library developed by Google (see [18]) for QR code 
detection and decoding. The Apps implement the 
tiqr challenge/response protocol, which is based on 
OCRA/HOTP [19], [2]; more information on the pro- 
tocol can be found in section 5.5. 


5.3.2 APP SECURITY CONSIDERATIONS 


The tiqr protocol relies on shared secrets for 
the challenge/response implementation. The secret 
is stored both on the phone as well as on the server. 


We can only reasonably assume that the phone 
with the App and the secret on it is a secure authen- 
tication factor if it is hard for an attacker to gain 
access to the actual secret. We therefore protect the 
secrets belonging to user identities by encrypting 
the secrets using PKCS #5 password-based encryp- 
tion [20]. The basis for encryption is the 4-digit PIN 
code the user chooses for the identity. 


Of course there are only 10000 possible PIN 
codes with a 4-digit PIN. We assume that it is easy 
for a motivated attacker to gain access to the en- 
crypted secret so we need to protect it against 
brute-force attacks. We achieve this by applying two 
principles. Firstly, the encrypted secret contains no 
internal structure (i.e. only the secret key - which is 
assumed to be truly random - is encrypted, there is 
no formatting around the key data before it is en- 
crypted). This automatically leads to a second level 
of protection: because the encrypted key has no 
structure around it, it is impossible to check if the 
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correct PIN was used to decrypt the secret since the 
decrypted data will look like random data in all cas- 
es. As a result of this, only the server can check if the 
correct PIN was entered because the computed re- 
sponse is only valid if the correct secret key was 
used. 


To prevent online attacks, we recommend that 
the server block an account after a pre-set number 
of failed authentication attempts (in fact our demo 
implementation blocks an account after 3 failed 
attempts). Depending on the desired security level, 
the server administrator may also decide to imple- 
ment some form of exponential back off mechanism 
to mitigate brute-force attacks. In this scenario, ac- 
counts are temporarily blocked after a failed login 
attempt. This thwarts brute-force attacks but is also 
more user friendly for legitimate users since enter- 
ing the wrong PIN more than a certain number of 
times will not immediately lead to a blocked ac- 
count. 


5.3.3 APP USER EXPERIENCE 


One of our main goals was to create an easy- 
to-use system. We have taken special care to ensure 
that the user experience of the App is as straight- 
forward, self-explanatory and smooth as possible. 
The prototype developed for the proof-of-concept 
was handed over to user-interface designers. They 
studied the concept and the prototype implementa- 
tion. Using storyboards, they designed an optimised 
user workflow. The main focus of the workflow is to 
make it self-evident to the user what the next logical 
step is going to be. Another change they introduced 
was to do away with a separate enrolment workflow 
(in the prototype, we had two completely separate 
workflows for enrolment and authentication). In 
stead, the user just scans the QR code that is shown 
and information in the code determines whether an 
authentication or an enrolment workflow is going to 
be followed. 


Another design decision that was made was to 
try to steer users toward using the same PIN code 
for all the identities managed by the tigr App. The 
reasoning behind this is that the user-interface de- 
signers feel that it is counter-intuitive for most us- 
ers to have multiple PIN codes in a single applica- 
tion. This concept is on the one hand very subtly 
integrated in the user experience by using sugges- 
tive wording (i.e. when enroling a new identity us- 
ing the text “Please enter your PIN” when they have 
to choose a new PIN for the identity rather than 
“Please choose a new PIN”). On the other hand, it 
has also been taken quite far in that if a single iden- 
tity becomes blocked due to entering the wrong PIN 
too many times, the App will block all identities it 
manages. We have not had sufficient user feedback 
to be able to decide whether or not this was a good 
choice; so far, we have had some feedback from 
third party developers that they feel this to be a bad 
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choice. We hope to learn more in a pilot implemen- 
tation that we are planning for the fall of 2011. 


One final thing to note about the user experi- 
ence is that we integrated an aide-memoire into the 
PIN entry dialog. We use icons with animal shapes 
to help the user remember their PIN (as shown in 
the figure below). 


Enter your PIN 


You need a PIN code for this account. If you 
don't yet have a PIN code for tiqr please 
choose one. 


Boon 


Figure 10 - PIN entry showing animal reminders 





If the user enters the correct PIN, the same 
four animal icons should show up in the PIN entry 
field. Users can either remember the whole se- 
quence or elect to remember just the last icon. To 
ensure that the sequence changes when common 
PIN entry mistakes are made (such as swapping two 
digits) we use the Verhoeff checksum algorithm [32] 
for error detection. 


5.4 SERVER SIDE 
5.4.1 REQUIREMENTS 


As was already mentioned in the previous sec- 
tion, the basis for tigr is challenge/response authen- 
tication using shared secrets (more information on 
the protocol can be found in section 5.5). This 
means that the secret key information that is pre- 
sent on the phone also needs to be stored on the 
server. 


This, of course, puts certain requirements on 
the server implementation. User secrets should be 
stored encrypted, either on disk in a database or ina 
Hardware Security Module (HSM). 


Another thing that is required on the server 
side is a library that generates the QR codes used to 
convey the challenge to the user. There are good 
Open source implementations available for most 
common web application platforms. For our refer- 
ence implementation we use PHP QR Code [33]. 


The most important thing to pay attention to 
on the server side is that the protocol is implement- 
ed correctly. We provide a reference implementa- 
tion to show how the protocol works (which is dis- 
cussed in the next section). 


5.4.2 REFERENCE IMPLEMENTATION 


To give developers a head start at integrating 
tiqr into their application, we have developed a ref- 
erence implementation in PHP. This reference im- 
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plementation shows how the tigqr protocol works 
(see section 5.5). 


We did not put any security provisions in the 
reference implementation so it should not be used 
in production. We are considering creating a more 
secure implementation that people can deploy 
straightaway. 


5.4.3 SIMPLESAMLPHP MODULE 


As outlined in section 5.1, we plan to use tigqr 
in an identity federation. To show this concept in 
action, we have developed a plug-in module for the 
popular SimpleSAMLphp [34] identity management 
suite. 


Our demo portal (https://tiqr.org/demo/) us- 
es this implementation. 


It is currently based on our reference imple- 
mentation so it is not sufficiently secure yet for pro- 
duction use. We are collaborating with the Sim- 
pleSAMLphp team to create a production-ready 
version, which we hope to release in the autumn of 
2011. 


5.9 PROTOCOL 
5.5.1 GENERAL 


We rely solely on open standards and open 
specifications as a basis for the tiqr protocol. The 
following standards are used: 


° OCRA [19] - this is the suite of one-time pass- 
word algorithms used for tiqr _ chal- 
lenge/response 

° JSON [35] - this is the object notation used to 
exchange data in the tiqr protocol 

° HTTP over TLS - used to transport infor- 
mation exchanges securely 

° QR codes [9] 


5.5.2 ENROLMENT 


Enrolment starts with a QR code that is dis- 
played to the user. This QR code contains a URL 
with the following schema: 


tigrenroll://<uri> 


Where <url> must be a valid HTTPS URL that 
points to a location where the details for the enrol- 
ment request can be retrieved, for example: 


tigrenroll://https://demo.tigr.org/enroll 
/details?session=082176122169132630 


The tigr App will contact this URL to retrieve a 
JSON object with enrolment details. This object has 
the following syntax: 


{ 
“Service” *- 4 
“identifier”: <id>, 
“displayName”: <name>, 
*lLogouUrl’: <Logo=-urls, 
“TnitOourcl’* <inro=url se, 


WauthenticationUrl” : <auth=url>, 
*OCraSuLte’ +s <OCRA=suite>, 
“enrollmentUrl”: <enroll-url>, 
by 
“IOS by <4 
“LOUCHCLTLer’s <i>, 
“displayName”: <fullName> 


The service section of the object identifies the 
service to which the user is enrolling. The identity 
section provides details about the identity that is 
being enrolled. The fields in both sections of this 
object have the following semantics: 


: Service section 
° identifier - should contain a reversed do- 
main name (e.g. org.tiqr.demo) 
° displayName - should contain the name of 


the service 

° logoUrl - should contain a valid URL to a 
service logo; we recommend a PNG24 im- 
age 

° infoUrl - a URL linking to a webpage with 
more information about the identity pro- 
vider; this link is displayed on the “detailed 
information page” for the identity 

° authenticationUrl - should contain the URL 
for the authentication handler for this ser- 
vice 

° ocraSuite - the OCRA suite the server re- 
quires; the App uses this to determine the 
appropriate OCRA parameters (see [19], 
section 6)* 


° enrollmentUrI - should contain the URL for 
the one-time enrolment handler 
Identity section 
° identifier - should contain a unique user 


identifier used to identify the account 
° displayName - should contain the full 
name of the user 


*An example OCRA suite as specified by the 
server could for instance be: 


OCRA=1tHOTP=SHAlL=6rOHl0=5 


This OCRA suite specification breaks down as 
follows: 


. OCRA-1 - the OCRA algorithm version (in this 
case version 1, the current version) 

° HOTP-SHA1-6 - the cryptographic function to 
use (in this case HMAC OTP [2], with SHA-1 as 
hash algorithm using dynamic truncation to a 
6-digit value); the tiqr App supports all algo- 
rithms specified in the OCRA standard 

° QH10-S - the input for the challenge (in this 
case a 10-digit hexadecimal value represented 
as a String) and the size of the session data (in 
this case the default value of 64 bytes) 


For more examples see [19]. 
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When the user confirms enrolment, a new 
HTTPS connection is made to the enrolment server 
URL specified in the JSON object enrollmentUrl 
property. A POST request is sent across this link. 
This POST contains the following parameters: 


° secret — this is the shared secret; the secret is 
generated by the App on the phone; we cur- 
rently use 256-bit AES keys as secrets 

° notificationType — optional; this is the notifica- 
tion type used to send push messages to the 
App and can be set to either APNS (for Apple 
Push Notification Service) or C2DM (for An- 
droid push notifications) 

° notificationAddress - optional; notification- 
protocol specific address to which push notifi- 
cations can be sent 

° language - contains the user interface lan- 
guage of the user; this information may be 
used to display appropriate error messages in 
the user’s preferred language 


If enrolment is successful, the server will re- 
turn the string OK (with no white space before or 
after the string). When an error occurs, the normal 
HTTP error procedure is followed to return the er- 
ror to the App. 


5.5.3 AUTHENTICATION 


Authentication starts by displaying a QR code 
to the user. This QR code contains a URL encoded 
according to the following URL schema: 


tigrauth:// [<identitylIdentifier>@] 
<serviceldentifier>/ 
<sessionkey>/ 
<challenge>[?<return Url>] 


The fields in this URL have the following se- 
mantics: 


° identityldentifier — optional field specifying the 
user identity to use for authentication; may be 
used in a so-called step-up authentication sce- 
nario where the user has already logged in us- 
ing another means of authentication 

° serviceldentifier - the service identifier as 
specified during enrolment (the service do- 
main name in reverse domain notation, e.g. 
org.tiqr.demo) 

° sessionKey - session key for this authentication 
request; links the response to the active user 
session when submitted 

° challenge - the authentication challenge; the 
size of the challenge depends on the OCRA 
suite as specified during enrolment 

° returnUrl - optional field specifying the URL to 
return the user to after successful authentica- 
tion; this URL is only used if the session origi- 
nated from the mobile browser on the device 
containing the tiqr App 


The tigqr App will compute the response to the 
challenge using the algorithm that was specified 
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during enrolment. It will submit the response by 
setting up a HTTPS connection to the authentication 
endpoint specified during enrolment. The submis- 
sion is done using a POST with the following param- 
eters: 


° sessionKey — the session key received in the QR 
code identifying the user session that requires 
authentication 

° userld — the user identifier of the user attempt- 
ing to log in 

° response - the response computed to the chal- 
lenge specified in the QR code 

° language - the user’s preferred language; this 


information is used to display error messages 
in an appropriate language 


If authentication was successful, the POST re- 
quest returns the string OK (with no white space 
preceding or following the string). If authentication 
fails, the server will return one of the following er- 
ror messages: 


° INVALID_RESPONSET:attemptsLeft] - the re- 
sponse provided to the challenge was invalid; 
this is interpreted by both the App as well as 
the server as an incorrect PIN entry. The op- 
tional integer value attemptsLeft indicates the 
number of tries left to return a correct re- 
sponse (and enter the correct PIN) 

° INVALID_USERID - the server does not know 
the specified user 

° INVALID_CHALLENGE - there is no known 
challenge for the current session; this usually 
indicates that the challenge has become inva- 
lid because of a timeout 

° ACCOUNT_BLOCKED|:seconds] - indicates that 
the response provided to the challenge was 
invalid and that the associated user account is 
now blocked on the server; optionally, the ac- 
count may be temporarily blocked for a speci- 
fied number of seconds (this feature will be 
implemented as of version 1.2 of the tiqr App) 

° INVALID_REQUEST - the POST request con- 
tained incorrect parameter data and was not 
accepted by the server 

° ERROR - an unspecified error occurred 


5.6 INTEGRATION WITH APPLICATIONS 


As we already mentioned in section 5.4.2 and 
5.4.3, we already provide several options for inte- 
grating tiqr into existing applications. The reference 
implementations we provide can serve as a basis for 
integration into web applications, but tigr can also 
be used in other contexts. 


Shortly after the first release of tiqr an inde- 
pendent software vendor, RCDevs from France, in- 
tegrated support for tiqr into their OpenOTP Au- 
thentication Server [36]. Based on their existing 
integration with several products they were able to 
show that tigr can - for instance - be used as an 
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authentication method to log users in to a secure 
shell (SSH) session (see Figure 11 for an example). 
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Figure 11 - using tiqr to authenticate an SSH session* 


*Screenshot courtesy of Charly Rohart from RCDevs 


Another example of integration into third par- 
ty frameworks is the open request to integrate tiqr 
support into Shibboleth [37], a much-used frame- 
work for federated identity management. 


5.7 SECURITY AUDIT 


We hired an independent security auditor - 
Eindhoven-based Madison Gurkha, see 
http://www.madisongurkha.nl/ - to assess the se- 
curity of tiqr. The goals we set them were to: 


° Assess the architecture and design of tiqr from 
a security perspective 

° Perform a code audit of both the App for iOS as 
well as for Android 


° Perform a code audit of the reference server- 
side implementation 
° Perform security tests on the live solution 


(both server as well as client side) 


The security audit was performed on version 
1.0 of the App (as released in April 2011 for iOS and 
May 2011 for Android) and was finished in June. 
The outcome of the security audit was positive; alt- 
hough the auditors identified several issues that 
needed resolving, they did not find any flaws in the 
architecture of the solution (note: the audit report 
will be published on _ the’ tiqr website, 
https://tigr.org/, in the autumn of 2011). The most 
important remark from the auditors was that - 
strictly speaking - tigr does not offer full two-factor 
authentication since the smart phone platform is 
much more accessible to evildoers than say a pur- 
pose-built hardware OTP token. We agree with this 
but would like to add that tiqr nevertheless is a vast 
improvement security-wise over 
username/password. And relying on a smart phone 
also has distinct advantages; recent research has 


shown that users are likely to notice that their 
phone is missing fairly quickly (see [44]). We think 
that this is much less likely to be the case for e.g. 
OTP tokens, since the single-purpose nature of these 
devices means they are used much less frequently. 


We have taken care to include several security 
measures to deal with the inherent untrustworthy 
nature of smart phone platforms (see 5.3.2). The 
auditors agree that these measures indeed signifi- 
cantly enhance the security and they also agree that 
tiqr is an attractive and more secure alternative to 
username/password. They caution though that it is 
not fully equivalent to the security a hardware OTP 
token can offer. We would like to note that this is 
also true for the more traditional OTP Apps that 
OTP token vendors have started offering (see 2.4.2). 


Another remark that the auditors made is that 
tiqr is potentially vulnerable to phishing. Attackers 
could perform a man-in-the-middle attack by initiat- 
ing an authentication session, thus retrieving the QR 
code containing the challenge and by displaying this 
on a fake site, tricking the user in to logging in but 
instead giving the attacker access to their account. 
We agree that this is a risk and as mitigation the tiqr 
App always displays the fully qualified domain 
name of the site that the user is being authenticated 
to. Users are expected to validate the authenticity of 
the site they are logging into in the same way they 
do for e.g. their banking site, i.e. by checking the 
site’s URL and server certificate. Note that this prob- 
lem is not unique to tiqr; all OTP solutions are 
equally vulnerable to phishing. 


The issues in the code and design identified by 
the auditors were resolved in version 1.1 of the App 
and in the reference server-side implementation 
that is available from the tigr website. We also con- 
tributed fixes for the vulnerabilities in the OCRA 
reference implementation back to the authors of the 
RFC. 


5.8 AVAILABILITY 


From the onset it has been our goal to make 
tiqr freely available to all Internet users. To achieve 
this goal, we have released all relevant software in 
open source under a BSD-style licence and we have 
made the tiqr Apps available for free in both the App 
Store as well as on the Android Market. 


All source code and documentation as well as a 
demo server can be found on our website, 


https://tiqr.org/ 
5.9 ROADMAP AND FUTURE WORK 


Now that a production-ready version is availa- 
ble our next step will be to deploy tigr within our 
own organisation. SURFnet currently uses X.509 
software certificates for authentication to certain 
services. We plan to replace these by tiqr. All 
SURFnet employees have either an iOS- or Android- 
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based smart phone making it an ideal user popula- 
tion in which to use tiqr. 


When we have gained real-world experience 
we will evaluate this deployment. Then, we plan to 
gradually introduce tiqr as an alternative means of 
authentication (alternative to username/password 
and/or SMS authentication) in some of the services 
we Offer to our constituency. 


We are also talking to our connected institu- 
tions to set up a pilot with a larger population. Our 
goal is to provide tigr as an alternative means of 
authentication on an identity provider in our federa- 
tion who currently only offer username/password. 


One of the things we will be evaluating in the- 
se pilot deployments is whether or not the paradigm 
of encouraging users to use the same PIN for all 
their tiqr accounts works and whether or not it 
makes sense to block all accounts if one account 
needs to be blocked because of too many failed at- 
tempts at entering the correct PIN. 


From a technological perspective, we are con- 
sidering pursuing several areas of research: 


° Turning tigr into a true hardware token by 
leveraging the possibilities offered by SD cards 
with an embedded smart card controller 
(smartSD cards, see [39]) 

° Incorporating attribute release into tiqr 
(where tigr releases attributes about a user 
asserted by a trusted third party), similar to 
the InfoCards paradigm (see [38]) 

° Using advances in cryptography such as zero- 
knowledge proof to further enhance the priva- 
cy aspects of tiqr. 

: Using tiqr for transaction signing (as is e.g. 
done by banks with OTP tokens to approve fi- 
nancial transactions). 


Some of these we will probably do ourselves, 
others we hope to pursue together with the academ- 
ic community in the areas of cryptography and digi- 
tal security. 


5.10 REFLECTION 


When we started the tigr project we set out to 
create a two-factor authentication solution that 
would leverage the benefits of using a device that 
(almost) everybody has: a mobile phone. We were 
also mindful that the solution should be an im- 
provement over username/password given the cri- 
teria introduced in section 4 and if possible it should 
also offer advantages over more traditional solu- 
tions. 


We feel that with tiqr we have achieved most 
of these goals. We believe tigr to be very user- 
friendly (much more so than some other two factor 
authentication solutions) and also believe that tigr 
is an improvement in terms of security over 
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username/password and on a par in that respect 
with many other two-factor authentication solu- 
tions. Furthermore, we believe that tigr can be a 
viable replacement for more traditional OTP solu- 
tions, especially the OTP Apps and SMS authentica- 
tion. 


We feel that we should point out, though, that 
tiqr is not a panacea that solves all problems in 
(two-factor) authentication. It is, for instance, just as 
vulnerable to phishing as traditional OTP solutions. 
And because it does not rely on purpose-built hard- 
ware to store the secret data associated with a us- 
er’s identity its security is not as strong as tradition- 
al tokens. 


Nevertheless, we feel that tiqr is a useful addi- 
tion to the two-factor authentication landscape. Its 
user-friendliness and the control over deployment it 
gives to organisations are also strong points. 


6 CLASSIFICATION REVISITED 
6.1 INTRODUCTION 


In section 4 we introduced a classification for 
authentication solutions, judging solutions on 
hardware (in)-dependence, — software (in)- 
dependence, security, cost, open standards compli- 
ance and ease-of-use. Now that we have introduced 
tiqr, we will revisit this classification. 


6.2 CLASSIFICATION OF TIQR 


Table 1 showed a classification of authentica- 
tion solutions according to the criteria introduced in 
section 4. In Table 2 we have reprinted this classifi- 
cation and added tigr at the bottom of the table 
(marked in grey). 


Hardware | Software Secauit Gast Open Basavaruce 
indep. aed esis ed 


Psion | = | = | | = [= | + 
sor | + |= | -]- |=. -— 
reas | + | a/= | + | f= | a=] = 
mw [=leet+ + be le 


Table 2 - Classification including tiqr 


Again, one could argue that any classification 
is subjective, especially since we are judging our 
own solution. Therefore, we have tried to justify the 
classification we have assigned to tiqr below: 


° Hardware (in-)dependence - tiqr requires ad- 
vanced features only available on smart 
phones; it is therefore not as hardware inde- 
pendent as some of the other solutions that re- 
ly on mobile phones 

° Software (in-)dependence - tigr is currently 
only available for two smart phone platforms 
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° Security - although not as secure as a dedicat- 
ed token, tiqr is much more secure than 
username/password and on a par with other 
OTP Apps 

° Cost - tigr is open source and available for 
free; the only inhibiting factor may be the cost 
of the device required to run tiqr 

° Open standards - tigr was built from the 
ground up to include open standards and the 
tiqr protocol itself has also been published 

° Ease-of-use - tigqr was designed to be user 
friendly from the ground up by skilled inter- 
face designers 


7 CONCLUSIONS AND RECOMMEN- 
DATIONS 


7.1 THE NEED FOR TWO-FACTOR AUTHEN- 
TICATION 


Our lives are increasingly being lived in the 
digital world. Social networks have become the sta- 
ple of a new generation and many professionals 
cannot live without e-mail, VoIP, and services like 
LinkedIn. Governments and the public sector are 
also increasingly making vast amounts of often per- 
sonal data (like medical records) available online. 


This means that the value of the digital identi- 
ties we use to access these services are becoming 
ever more valuable. It is no longer just your credit 
card that is at risk of being stolen, whole identities 
get hijacked. 


We feel that it is inevitable that two-factor au- 
thentication becomes more widespread and actively 
try to stimulate its adoption, both within our own 
community as well as on a wider scale. 


7.2 OPEN STANDARDS AND OPEN SOURCE 


The best way forward to bring two-factor au- 
thentication to a wider audience is the adoption of 
open standards by vendors. This applies foremost to 
vendors of hardware OTP and PKI tokens. It is also 
of paramount importance that integration of two- 
factor authentication in online services becomes 
easier. One way of achieving this is by releasing 
open source solutions with flexible licenses. These 
can serve as useful examples and facilitate rapid 
integration. 


7.3 TIQR 


We have strived to practice what we preach 
when creating tiqr. We have focused on creating an 
open standards-based, open source and easy-to-use 
solution that is freely available. We believe that we 
have succeeded in the goals we set ourselves in that 
respect and we hope that tigr can serve as a starting 
point for many organisations who want to integrate 
two-factor authentication into their online services. 


What is also noteworthy to mention is that 
parts of the tigr project have now been spun off as 
Separate open source projects because of their gen- 
eral applicability. These include TokenExchange (an 
abstraction for push notifications supporting sever- 
al device platforms), see [42], and a set of OCRA 
reference implementations in several programming 
languages, see [43]. 


7.4 RECOMMENDATIONS 


We have already outlined recommendations 
for future work on tigr in section 5.9. In addition to 
that, we would recommend any readers of this pa- 
per to invest some time into considering what two- 
factor authentication could add in terms of security 
both within their own organisations as well as for 
users of their online services. 
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Introduction 


Working as a Security Engineer for a research program in the Federal government is a 
lot of fun, but incredibly challenging. Research, rightfully, receives the lion’s share of 
funding, leaving very little for support services like IT and no funding for security specific 
activities. However, the burden of designing, implementing, analyzing, and reporting 
compliance to weighty government IT Security mandates like FISMA falls squarely on 
the IT section. 


Our IT staff is less than 10 people. We provide Help Desk, Linux server administration, 
networking (switches, IDS, firewalls, NMS), SQL Databases, SMB file shares, 
programming support, training, and implement in-house applications for scientific 
research mostly in Perl and PHP for our institute of 700-900 users. We are also 
responsible for reporting compliance with Federal, Institutional, and Divisional mandates 
to our oversight. 


In order to achieve all of this with a small staff, we’ve designed and implemented a lot of 
automation based on Open Source Software. We’ve learned how to leverage these 
tools to meet the needs of our institute and the requirements of those above us. 


As a pragmatic group with very little free time, we focus on building security tools that 
provide daily operational value. We simply do not have the resources to implement 
controls for the sake of the controls themselves. 


The More You Know 


Most IT Security controls focus on first understanding your systems. A system in this 
sense is defined as the computers, people, and networks that work together to perform 
a task. In order to begin classifying, we need to know what we have and where. The 
first step was to rollout a comprehensive centralized logging infrastructure for our UNIX 
and Windows servers. 
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We chose to use syslog-ng as a basis for our centralized logging platform for one 
incredibly useful feature: 


destination d_subscriptions { 
program(“/usr/local/eris/bin/syslog-ng-service.pl”); 


3; 


This feature starts the program specified keeping a handle of that program’s STDIN 
open to dispatch messages to based on the “log {}’ definitions specified in the 
configuration file. This removes the startup overhead from the called program allowing 
the use of programs written in dynamic scripting languages which incur enormous 
startup penalties. It also ensures that the program end point is available while syslog- 
ng is running, meaning there’s no additional program supervision necessary. 


In order to facilitate rapid development of syslog based event correlators, we developed 
this program to convert the incoming syslog stream into a TCP based service that a 
script can connect to and “subscribe” to feeds of interest. For instance, the inventory 
application subscribes to dhcpd, MSWinEventLog, sshd, arpwatch, and smbd. This 
keeps the database size smaller and focussed on the events that prove most useful to 
operations. 


A Safe Place to Keep Our Data 


In order to facilitate strange and novel concepts in correlation, it was clear that a 
Relational Database would be awesome as a storage engine to allow indexing and 
searching of the data we received. A PostgreSQL database server was setup and 
configured to allow data storage and retrieval. The reasons for this are numerous, but 
at the time of initial development it was the only Open Source database with views, 
stored procedures, triggers, and a slew of particularly relevant native data types 
including network types for IP addresses, networks, and MAC addresses. PostgreSQL 
has continued to make dramatic improvements to performance and usability since that 
time, and continues to be a leading Open Source RDBMS with unparalleled features 
and reliability. 


PostgreSQL’s PL/PgSQL language extension which was designed to be as close to 
Oracle’s PL/SQL provides the option to do data correlation and validation in the 
database through the use of stored procedures and triggers. This facilitates rapid 
development of scripts placing data in the database as the “business logic” can be 
implemented at the data storage level. There is a performance penalty for doing this, 
but it allows for correlation to occur automatically as new datasources are added. 


The setup we've designed is represented via the diagram on the following page. 
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Conceptual Overview of the System 
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Connecting the Dots 


DHCP logs serve as a primary jumping off point for data correlation. We can simply 
store MAC, IP, and hostname attributes in a table and use them for lookups. We chose 
Netdisco as an Open Source layer 2 network management system that could be 
deployed to PostgreSQL. With a few triggers added to the Netdisco system, we can 
correlate MAC addresses to switch and port which allow our staff to quickly establish 
building, floor, and wing for any IP address on our network. 


Using Samba and MS Windows Event Logs, we were able to discover the 
ActiveDirectory account name logged in to any client system. This allows simple IP to 
username correlation for things like IDS, but more importantly username to IP matching 
so our Help Desk don’t have to walk every user through the “Start -> Run -> cmd -> 
ipconfig” routine, saving a few minutes with each call to the Help Desk. 


Years prior to this system, our staff was asked to supplement the existing Enterprise 
Directory Service which was developed for all of NIH, with a number of specific 
enhancements specific to our Intramural Research Program. One of the features 
required every user to be assigned a “Supervisor” attribute that links to another person 
object. Each person object contains full name, email, AD account, phone number, 
building, room, lab, and a pointer to the supervisor person object. This data was 
imported and synced to the PostgreSQL database as it would prove useful. 
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Basic Inventory Information 


So with a carefully designed PG database and a few hundred lines of Perl code, we’ve 
managed to establish the following relationships: 


dhcp netdisco 
MAC | » MAC 
iP | Switch 
Hostname Port 

Employee Database 
smb / Event Logs | » Username 
ip <-— Full Name 
Username Phone 

Physical Location 
Building Information _ Supervisor Pointer 
Switch i“ : |Group Membership 
Building | 
Floor and Wing 


What this means Is: 


* any event on the network containing a MAC address, IP address, or username can 
be correlated to any of the others 
¢ and can also be correlated back to the metadata on the username and location 


This allows classification of events in terms of business structure or geographic location 
all from data already available. 


The tables which implement the storage keep track of dates and times inventory events 
such as DHCP, login, and ARP discovery occur. Since we’re siphoning this data from 
the network servers and switches, we’re not relying on high level or complicated 
protocol like IBM Tivoli End Point Manager to discover the devices. When a device is 
plugged into the network, we are able to immediately see it’s been connected and 
record the date, time, and location of the event. 


Now, when we need to report on the number of Apple computers, we can loop through 
the MAC Addresses seen in the past 6 months, look up the manufacturer data from the 
OUI Database, and report more accurately the number of active Apple Computers on 
our network. 
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Creating Useful Security 


By wrapping this in a searchable web application, the Help Desk staff can now be far 
more efficient on each call. 


User Details searching for. “ihotskyb- 
Status:( Active 


Full Name : Brac \hotsky 
Email : hotskyb@mail.nh. gov 
Lab; IRP RAB 
AD Last Logon : 2011-06-08T17:03:32 
erts Roles : ers: ‘login, eris::admin 


Authentication Mstory User's Devices 
Sa systems)! have)been|assigned’ 
ownershipjof 





Shew 10 * nn lee 
ou roo Te 
a 
userAtmost 201 1 06-097 14:01:01 201 1-06-097T14-01:01 5 
user ALHost 201 1-06-09T14001:01 201 1-06-09T14.01:01 i 
uberAlmost 2011 -06-09T13-01:01 201 1-06-O09T13-01°01 ; 
user AlMost 2011 -06-09T13201:01 201 1-06-0971 3:01:01 i 
sand 91742906 201 5 -06-O9T12°51:46 205 1-06-O9T12:51:46 : 
eee systems) ||have)been|logged onto? 0*03"17-01:e1 ’ 
s5md Nae RNS Sie "FO 1-O6-O0T 14051156 : 
uberAlMost nw-3YsoS 201 1-06-037T14:01:01 201 1-06-03714-01:01 5 
user AlPost Abel 2011 -06-03714:01:01 201 1-06-03714:01:01 i 
sod ieee 20113-06-037T13-53: 16 201 1-06-007T13-53-39 > 
Showing 1 to 10 of 72 entries o 


This is excellent, but the real benefits of this system begin to show up as we integrated 
it with more Open Source security products. 


Intrusion Detection and Correlation 


We chose Snort as our Open Source IDS. Snort is free, fast, and stable. It does take a 
considerable amount of setup and configuration, but any signature based IDS solution 
will require that overhead. After being configured to listen through a network tap to a 
bonded interface on a CentOS box, we end up getting alerts that look like this: 


Jun 2 12:10:55 myids snortLl2908]: [1:2012647:2] ET POLICY Dropbox.com 
Offsite File Backup in Use [Classification: Potential Corporate Privacy 
Violation] LPriority: 1] {TCP} 137.x.x.x:1211 -> 199.47.216.144: 8@ 


This output is familiar to security professionals. We do have an IP address, which we 
have already established can be attributed back to a username, and a username back 
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to their the organizational unit. This alert is parsed by the system and the relevant data 
stripped off and classified. An alert like this is classified in our system as “Potential Data 
Loss Event.” 


We can then generate reports based on organizational unit and event classification 
which can tell us interesting things about a lab or section that we may not really want to 
know! 


Configuration Management or DevOps for Compliance 
All you base are belong to .. 


Puppet, or any other configuration management engine that suits your needs will be 
sufficient. Spend some time evaluating the various configuration management engines 
and choose the one that best suits your organization. | cannot say enough good things 
about a good CM system that fits your organization. 


We deployed Puppet and then put everything into our Version Control System (VCS) 
with commit hooks to automatically deploy new tagged release to the PuppetMaster. 
Using Puppet’s Domain Specific Language (DSL), | was able to convince our small staff 
of old school developers to embrace VCS after | built and demonstrated this: 


subversion: :deploy { ‘project_name’ : 
svnurl => ‘svn+ssh://svn-readonly/repos/section/projectname’ , 
target => ‘/opt/local/project_name’ , 
notify => Servicel ‘httpd’ ] 

I 


Which requires only a Subversion project directory with a trunk/ and tags/ subdirectory. 
A bash script for tagging releases is distributed to /usr/local/bin/svntag which makes 
creating incremental release tags as easy as typing “svntag.” Puppet will then use 
$target/RELEASE to maintain the release number that’s been deployed to that target 
and anytime a new release is tagged, it will be automatically deployed at that location. 
Additions for allowing hostname-based configuration files was incorporated into a macro 
based off this: 


webapp: :deploy { ‘name’: project => ‘name’, config => ‘name.yml’ } 


This expands to doing much the same as the previous example including the httpd 
restart, but also deploys the application configuration after the checkout completes, 
overwriting the development configurations that are stored in the subversion repository. 


We run RedHat based distributions (CentOS,Fedora, and now Scientific Linux). Puppet 
was a great start, but Cobbler has solidified our CM platform. Cobbler is a KickStart 
based systems build platform that utilizes PXE to automate installs of RedHat based 
distributions. There is some work in process to extend it’s functionality to Debian 
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systems. One reason to choose a build system like Cobbler, instead of an imaging 
solution like Ghost, is deployment to a hodgepodge of hand-me-down hardware that we 
maintain in a Research program. 


When Cobbler performs an install, it’s configured to include Puppet in the build, setting it 
to run at first boot. Using Puppet and Cobbler to rebuild my IDS sensor when | had to 
replace the hard drive took 37 minutes from PXE boot to up and running with Snort and 
syslog-ng for centralized logging. 


And what exactly does this have to do with Security? 


A lot. Using a configuration management suite provides countless security benefits. 
First, it is the ultimate tool for guaranteeing consistent configuration across your 
network. It also offers the most benefit when each system is configured as much as 
possible by the CM. This allows a system administrator to PXE boot a new piece of 
hardware to replace an existing server and have the box configured identically in under 
an hour. 


What we also get is a free inventory of all our servers. Since, as logical people, we tend 
to name classes and definitions something meaningful, we can leverage the CM tool to 
report on system functions and logical groupings. Puppet stores it’s catalogs and states 
in simple YAML files which are parsed quickly and efficiently by your language of 
choice. 


Configuration Managements, System Inventories, Software Inventories, are all provided. 
lt’s even possible to view the state of the compliance with the catalogs using the Puppet 
Dashboard. There are happy green and stressed out red lights for your auditors to 
admire! 


Sample Puppet Dashboard 


a0e0 Puppet Node Manager 
<4|> | + |@ http://localhost:3000/nodes/bast.reductivelabs.com c (c Qy Google 


A\Puppet dashboard Node Manager | My Account Logout 


Nodes bast.reductivelabs.com Edit | Destroy 
All 
Groups Classes 


No Groups No Classes 


Successful 
Failed 
Unreported Daily Run Status 


1) 
50 
40 
30 
20 
10 
< 2 2 2 
ye vy yy se 


Run Time (ms) 


By Class 
sample_class 
NodeClass000001 
NodeClass000002 
NodeClass000003 
NodeClass000004 
NodeClass000005 
NodeClass000006 
NodeClass000007 
NodeClass000008 
NodeClass000009 
NodeClass000010 
NodeClass00001 1 
NodeClass000012 
NodeClass000013 : 
NodeClass000014 0 
NodeClass000015 @ 03/12/09 04:54pm 126 0 8.08 

0 

0 








7 ms 
6 ms 
5 ms 
4 ms 
3 ms 
2 ms 
1 ms 





Recent Reports (153) 
Reported at Total 


@ 03/12/09 05:25pm 126 


NodeClass000016 @ 03/12/09 04:24pm 126 


NodeCiess000017 @ 03/12/09 03:54pm 126 
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Extending the Functionality 


After the collection and storage of all this data to a relational database, we’ve found it 
indispensable to solving problems on a day to day basis. From simple one-off scripts to 
determine the number of Apple computers on the network, to more extensive systems 
that were trivial to implement on the back of the data we’ve collected. Consider our 
researchers requirement to utilize Skype to collaborate with international colleagues at 
no cost. The Federal Government prohibits the use of Skype, unless there are 
adequate compensating controls in place. 


Using the data we’ve collected, we developed an automated tracking of Skype users. 
To receive a waiver from the Departmental Policy we were required to Skype users at 
our Institute affirm a monthly “Rules of Behavior’ (ROB) update. The process of 
discovery and tracking of Skype usage looks something like this: 


1. IDS signatures classified as “Skype” are correlated to Usernames using the 
database. 
2. The usernames are checked against a table of “Accepted RoB’s” and compared. 
a. If this is the first event, ie, no rows in the RoB table, the user is emailed the RoB 
and must click a link, sign in, and agree to the terms 
b. Otherwise, the “last detected skype usage” timestamp is updated 
3. If at the time of detection, it has been a month or more since the acceptance of the 
RoB, the user is again emailed the link, asked to sign in, and agree to the terms of 
the RoB. 
4. Everyday a list of users required to accept the RoB is compounded and emailed to 
the Administrators with their status included. 
a. The administrators receive the Phone number, Building/Room information, and 
the Lab Manager details for each user, aiding in persuading the user to accept 
the RoB. 


This system is mostly automated, except for the occasional phone calls to the users 
requesting that they agree to the RoB terms. Other potential solutions to this problem 
exist, but often require complex proxy configurations that break if the user takes their 
laptop offsite, or manual exceptions by statically assigned IP addresses and manual 
tracking of RoB Acceptance. Our solution saves time, energy, and resources. It is only 
minimally invasive to the end-users and barely noticeable to the administrators! 


Lessons Learned 


By choosing to develop this system in house, we have gained invaluable experience 
and knowledge. The development infrastructure to support the development of our 
custom inventory and security correlation engine has lead to near-mastery of Modern 
Perl, PostgreSQL, centralized logging infrastructure, and VCS, both Subversion and Git. 
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The fact that we’re storing everything in a relational database provides us with the ability 
to “mash up” data from disparate sources. We’ve been able to successfully respond to 
data calls from our parent organization with SQL statements that we can reproduce time 
and time again. 


Sure, we didn’t get free t-shirts, calendars, and pens from vendors. We didn’t go to 
training sessions at fancy hotels to learn to use each piece of the system. We can’t hire 
someone with a specific vendor certification to replace a team member if they leave. 
However, the entire team has learned to work together and everyone has increased 
their abilities in many different areas. Our Help Desk staff Know Windows, Linux, Mac 
OS X, and some rudimentary programming. They also understand the network and 
how it works. This type of training, which lacks the polish and formality offered by the 
big name vendors, is invaluable to day to day Operations. 


But, the best part of the system is it’s being used, every day,by Ops team members to 
make a difference. 
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Resources 


Open Source Software Mentioned 


- syslog-ng: Replacement for standard syslog 
¢ http://www. balabit.com/network-security/syslog-ng 
- PostgreSQL: Open Source Relational Database 
¢ http://postgresql.org 
« Netdisco: Open Source Network Management System 
« http://netdisco.org 


- Perl Components (hitp://perl.com) 
- Catalyst: MVC Framework 


- http://catalyst.perl.org 
¢ POE: Event driven Perl library 


« http://poe.perl.org 
- Snort: Open Source Intrusion Detection System 
« http://snort.org 
- Puppet: Open Source Configuration Management Engine 
¢ http:/Awww.puppetlabs.com 
- Subversion: Open Source Centralized Version Control System 
¢ http://subversion.tigris.org/ 
¢- Cobbler: Open Source Installation Server 
« https://fedorahosted.org/cobbler/ 


Projects by Brad Lhotsky (https://github.com/reyjrar) 


¢ svnutils: A collection of Subversion utilities for automatic deployment and integration 
with Puppet 


¢ https://github.com/reyjrar/svnutils 

- optperl: Spec files for installing Perl into /opt including integration with Puppet 
¢ https://github.com/reyjrar/optper| 

- POE::Component::Client::eris: Perl module for subscription based log tailing 
- https://github.com/reyjrar/POE-Component-Client-eris 

¢ eris: Network Console which collects and correlates data 
¢ https://github.com/reyjrar/eris 
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ABSTRACT 


In this paper we describe a method for near real-time 
identification of attack behavior and local security policy 
violations taking place over SSH. A rational is provided for the 
placement of instrumentation points within SSHD based on the 
analysis of data flow within the OpenSSH application as well as 
our overall architectural design and design principles. Sample 
attack and performance analysis examples are also provided. 


Categories and Subject Descriptors 
D.4.6 [Security and Protection]: Information Flow Controls — 


General Terms 
Measurement, Security. 


Keywords 
SSH, keystroke logging, Bro IDS, Intrusion Detection, policy 
enforcement. 


1. INTRODUCTION 


The adoption of SSH as the defacto protocol for interactive shell 
access has proven to be extremely successful in terms of avoiding 
shared media credential theft and man in the middle attacks. At 
the same time it has also created difficulty for attack detection and 
forensic analysis for the computer security community. The SSH 
protocol and it’s implementations such as OpenSSH [9] provide 
tremendous power and flexibility. Examples of this flexibility 
include authentication and encryption options, shell access, 
remote application execution and X11 and SOCKS forwarding. 
While the benefits gained vastly exceed the difficulties introduced 
by this protocol, the loss of visibility into user activity created 
problems for the security groups tasked with monitoring network 
based logins and activity. 


The National Energy Research Scientific Computing Center 
(NERSC) is the primary open science computing facility for the 
Office of Science in the U.S. Department of Energy. It is one of 
the largest facilities in the world devoted to _ providing 
computational resources and expertise for basic scientific 
research, and has on the average 4000 users across seven primary 
computational platforms. The significant majority of user 
interaction involves interactive ssh logins. To address this lack of 
visibility into user activity on our high performance computing 
(HPC) infrastructure, we introduced an instrumentation layer into 
the OpenSSH application and feed the output into a real time 
analyzer based on the Bro IDS. This instrumentation provides 
application data such as user keystrokes and login details, as well 
as metadata from the SSHD such as session and channel creation 
details. This data is fed to an analyzer where local site security 
policy is applied to it, allowing decisions to be made regarding 
hostile activity. The data analyzer is based on the Bro intrusion 
detection system (IDS) [10] which provides a native scripting 
language to handle data structures, tables, timers to express local 
security policy. In this capacity Bro is being used as a flexible 
data interpreter. A key differentiator between the instrumented 
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SSHD (iSSHD) and many other security tools and research 
projects is that iSSHD 1s not designed to detect and act on single 
anomalous events (like unexpected command sequences), but 
rather to enforce local security policy on data provided by the 
running SSHD instances. 


A key idea is that the generation of data is completely decoupled 
from its analysis. The iSSHD instance generates data and the 
analyzer applies local policy to it. By using the Broccoli library 
[4], we convert the structured text data output by iSSHD into 
native bro events that are processed by the analyzer system [3]. 
Events, as their name implies, are single actions or decisions made 
by a user that are agnostic from a security analysis perspective. 
Bro processes these events in the same way as network traffic 
events, applying local security policy to interpret them as desired. 


Local security policy can be thought of as sets of heuristics that 
describe (in this context) what behaviors are considered 
unacceptable or suspect. This behavior might be a command like 
“mkdir ...”, application usage like remotely executing a login 
shell, or tunneling traffic to avoid blocked ports. 1SSHD was 
designed so that the installed SSHD instance would not need to be 
modified with every new threat. Instead, changes are made on the 
analysis/policy side as new problems are identified. This not only 
simplifies administration, but also allows experiments to be run on 
previous logs without significant work. 


While NERSC has no explicit legal or privacy issues with 
intercepting communications on local systems, we recognize the 
importance of an informed user and staff population. To help 
address this we chose a policy of complete transparency. Each 
major group at NERSC was allowed representation in the design 
process and code review. As well, the entire user community was 
alerted to the changes by making announcements at User Group 
meetings and email notices. The complete source code 1s 
available to anyone interested and can be secured through the 
LBNL Technology Transfer Office. 


The iSSHD project has been used in production capacity at 
NERSC for nearly three years on approximately 350 hosts. There 
are around 4000 user accounts with a daily average of 52,000 
logins per day on the collective set of multi-user systems. In 
addition to the obvious security functionality, there are a number 
of other non-security purposes like debugging user problems or 
job analysis where having access to historical keystroke data has 
been quite beneficial in tracking down systems problems. 


The reminder of the paper is structured as follows. In related 
work similar coding projects and tools are presented. Next the 
execution flow within an unmodified OpenSSH 5.8p1 instance is 
mapped out. This flow provides a way to determine the most 
effective points for instrumentation. In section four, the overall 
architecture and design goals are detailed including the integration 
of Bro into the process. Section five provides implementation 


LISA 711: 25th Large Installation System Administration Conference 109 


details describing the inherent tradeoffs between complete 
monitoring and resource limitations. Section six has examples of 
attacks and some rudimentary analysis. Finally future work and 
references are provided. 


2. RELATED WORK 


Related work can be generalized into several groups. These are 
research projects relating to SSH data access, hacker activities, 
and more generalized detection of SSH credential theft detection 
in the HPC environment. 


The work most similar to our own involves the hacker 
community’s use of backdoored SSHD instances to steal 
authentication credentials. In principle there is little difference 
between this behavior and the functionality provided by the 
iSSHD except in terms of the breadth of data provided. Statically 
backdoored OpenSSH code has been around since at least 1999 
[14], and more recent versions are trivial to locate - see [15] for 
example. 


Besides directly replacing the existing SSHD binary, there are at 
least three additional ways to access session data. The first is via 
direct access to a user’s terminal devices by a privileged user. 
This can be achieved by one of dozens of small applications or as 
part of a larger kernel rootkit [18]. A more subtle approach is to 
interfere with kernel level behavior, thereby preventing a user 
space analysis of the terminals from giving away the access. 
Typically rather than just looking at terminal IO, input and output 
system calls are intercepted via a hidden kernel module. This 
information is transmitted to an analysis tool or recorded. There 
are innumerable examples of this approach within the rootkit 
community [11] as well as Honeypot implementations such as 
Sebek [13]. Finally you can interact with the running SSHD 
process by injecting code into it [16] or using process debugging 
to “jump” from their stolen user account to a potentially 
privileged session on another machine [17] [1]. These last two 
cases are somewhat subtle in that no changes to the actual static 
(non-running) binary are made. 


There is a general class of SSH related security work focusing on 
user account theft via anomaly detection, both in terms of 
command sets as well as process accounting data. These include 
Yurcik [21] [22] and Joohan Lee et al. [5] who look for account 
compromises within the HPC domain via accounting and 
command analysis. Historically, there is a rich collection of 
research relating to account masquerading, with a nice write-up by 
Malek et al. [6]. This last class of ideas can be fed by or used 
with the iSSHD and incorporated into the sites overall intrusion 
detection design since they are orthogonal to the actual iSSHD. 


3. SSH Application and Protocol 


In order to identify the best places to place instrumentation within 
the SSH application, it is necessary to understand the code path 
taken by typical behavior as well as subtleties within the protocol. 


From a historical perspective there are two individual (and 
incompatible) versions of the SSH protocol available. Tatu 
Ylonen created version | in 1995 as a replacement for the then 
ubiquitous telnet and rlogin protocols. OpenSSH emerged with 
the OpenBSD group taking up development after a number of 
organizational changes including the splitting of the Ylonen code 
base at one of it’s last open source implementations. The SecSH 
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IETF working group developed version 2 originally published in 
1998 and in 2006 a revised version of the protocol was adopted as 
a standard in RFC 4250 (Protocol Assigned Numbers) [23], 4251 
(Protocol Architecture) [24], 4252 (Authentication Protocol) [25], 
4253 (Transport Layer Protocol) [26], 4254 (Connection Protocol) 
[27]. 


In terms of this analysis, all paths and descriptions assume the use 
of version 2 protocol since version | has suffered a number of 
pathological security defects [19] which reduce it’s use to older 
and unusual cases. In the case of the actual code instrumentation, 
this assumption is not made and both version | and 2 provide 
nearly identical logging. Section 3.1 represents a general 
overview and relationship between RFC and OpenSSH structure. 
Section 3.2 takes this high level design and fleshes it out, 
providing a code path and rational for instrumentation locations. 


3.1 SSH Application and Protocol Layering 
For this initial description we avoid taking into consideration a 
number of details in order to focus on the overall flow of 
information and data. For a generic shell interaction a simplified 
diagram of the data flow might look something like Figure 1. 
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Figure 1: Application vs. Protocol design for typical SSHD session 


Here Figure 1| 1s broken out into two columns — on the left there is 
the protocol layering as defined in RFC 4250-4254. The right side 
describes the application implementation of those layers. It 1s 
worth noting that the layers do not map | to | - in particular the 
role of the session object within the application, which according 
to RFC 4253 should be rolled into the transport layer. Here each 
application layer is a functional layer within the application, with 
the parent SSHD is represented as the top block. After a 
successful network connection is made, the process forks, and an 
authentication context A is created. This context is used for the 
lifetime of the login and is used to track a number of 
authentication based data values. 


During the next step Key Exchange occurs, where the actual 
negotiation for a cipher, MAC and compression take place. First 
server authentication takes place via server/host key pairs. This 
authentication is transparent to the user if they have visited that 
SSHD server in the past. Assuming the server authentication is 
successful, algorithm negotiation for cipher and MAC takes place. 
Finally the short-lived session key is generated which is used to 
provide symmetric encryption for the data stream. This key is 
periodically re-negotiated after a given time or data volume 
passes. Since this is a reasonably well studied and logged area of 
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Figure 2: Internal SSHD Data Flow 


the application, none of the exchange is recorded in the iSSHD 
besides what the system logs already do. Ifa strong reason to log 
the session crypto data could be come up with, there is no reason 
why it could not be done. 


The Authentication Layer (unsurprisingly) provides the actual 
user authentication process. This process is extremely flexible 
with a number of options natively defined by the application as 
well as any generic PAM infrastructure. During the 
authentication process more than one type of authentication type 
can be examined so multiple fail and postpone events can be 
generated even for a successful login. Since we are less interested 
in the details of the authentication process than the outcome, there 
is little or no detailed logging from 1SSHD except for the 
success/failure declaration as well as the authentication type being 
used. We apply the same rational to the key exchange process 
since in both cases relevant data can be preserved in regular 
system logs. 


If the authentication process proves successful, a Session Object is 
created. This will be the primary container for not only the 
authentication context, but tty, X11 and channel data as well. The 
Session layer code also controls the mechanics of user login such 
as the login process, remote command execution, pty allocation 
and X11 forwarding. 


The session object can create, use and destroy Channels. A 
channel can be thought of as a connection within the Session 
Object that has well defined semantics for data movement, 
windowing information, file descriptors and multiplexing 
capacity. Typically for a shell, you would allocate a single 
channel that holds the file descriptors for stdin, stdout and stderr. 
It is not unusual though to have many additional channels in use 
for X-windows, SOCKS forwarding and authentication agents. 
Data within a channel is not encrypted since it is contained within 
a session which already is. This is a critical point for monitoring 
which we will use to our advantage. 


3.2 Common Code Paths During Execution 

Now that the behavior of OpenSSH for a typical login has been 
described, we can more closely examine code paths for strategic 
places to insert instrumentation. Identifying those paths involved 
reading the source code as well as experimenting with sessions 
running in debug mode. Since the most common service for SSH 
to provide is remote shell login access, it was the initial target for 
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both analysis and instrumentation. The execution path for this 1s 
identical to that shown in Figure 1, except for some additional 
details found in the session section. A location is considered a 
good candidate for auditing if (1) there exists a decision making 
branch where most or all connections traverse or (2) a final state is 
arrived at which contains security relevant information. 


Figure 2 provides a more detailed set of code paths for nearly any 
use of OpenSSH. Here every box represents a transition between 
user privilege or application function and ultimately represents an 
event sent to the 1iSSHD analyzer. The creation of the Session 
Object (SO) begins on the left side and the path moves to the right 
till the users objective is reached. In it, a number of common 
paths that immediately stand out. The horizontal split between 
session and tunnel driven services is an obvious candidate for 
instrumentation. As a reminder, the session code tends to be more 
execution oriented — 1.e. involved with the invocation of services, 
commands and shells. Since it is not unusual for an attacker to 
use a known tool or service in a way which is unusual, how we 
instrument the path is extremely important. Decision branches 
such as “session-in-channel-open” provide the path of what was 
asked for, and logging details at the end of the code path provide 
information regarding what was actually done. In any case, policy 
can be written to provide notice if the local site finds any part of 
the execution path objectionable. 


Using the same rational, the lower half of Figure 2 provides the 
same opportunity to audit this behavior in some detail for 
tunneling and port forwarding activity. While not implemented in 
this design, it should be at least possible (though perhaps not 
practical) to access the forwarded data instead of just identifying 
the static forwarding requests. 


The level of logging may seem excessive, but such detail can 
prove to be quite powerful for forensic analysis when combined 
with local site policy. Local site policy - described later in some 
detail - can act on specific session events like tunneling which 
may not be allowed by a centers usage policy. There is a huge 
benefit to be had in identifying the exact execution path of an 
attacker. Since it is not unusual for a tool like ssh to be used in a 
way which was not foreseen by the security community we tend 
to error on the side of caution. 
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4. SYSTEM ARCHITECTURE 


For the i1SSHD architecture, we selected three principles 
fundamental to the design and implementation process. If at any 
time one of these principles was in contradiction with the design, 
something was wrong with the architecture. The principles are: 


1. Avoid introducing stability or security problems: We 
need to demonstrate with high confidence that our 
modified version of SSH is just as stable and secure as 
the original code base. 

2. Unchanged user experience: The modified version of 
SSH can not affect the way users interact with NERSC 
systems, require a special version of the SSH client or 
application, nor remove any existing capabilities. 

3. Minimal impact on_ system resources: System 
resources including CPU time, memory, and network 
bandwidth are at a premium. Additional demands made 
by the instrumented SSH must be _ insignificant 
compared to an unmodified SSH instance. 


Based on these requirements, the following choices were made in 
the architecture and development plan: 


1. Use OpenSSH as the code base. OpenSSH has an 
exceptionally good reputation and is already used on the 
multi-user production systems. In addition, we were 
able to add on the Pittsburgh Supercomputing Center’s 
high performance OpenSSH patch set [12]. This 
provides significant gains in terms of bulk data transfer 
performance. 

2. Minimizing changes to the code base. As part of the 
project we made an active attempt to minimize the 
number of changes to the original code. In addition, we 
chose to use other tools and capabilities rather than 
write them ourselves. An example of this would be the 
use of stunnel [20] rather than attempting to write an 
add on to ssh for our own data encryption. 

3. Decoupled Analysis: Taking our experience from the 
Bro IDS, we chose to fully decouple the analysis from 
the generation of the ssh instrumentation data. To do 
this it was necessary to remove any dependencies 
between the running iSSHD and the back end analysis. 
This is done by making all writes to the back end non- 
blocking stressing that a failure of the analysis 
infrastructure should result in the loss of security data 
before an interrupted user experience. 


The overall design of the 1iSSHD can be broken out into two 
sections — the event generation within the running iSSHD process, 
and the logging and analysis that compares those events against 
local policy. Much of §3 was involved with the thought process 
that took place before the coding started. With that in mind, we 
turn to the actual design and implementation of the system itself. 


It should be noted that the core of the analysis side currently exists 
as a log repository with scripts feeding live data to the Bro IDS. 
The use of Bro is not technically required since the file exists as 
structured text, which provides the ability to feed the information 
to any another tool. We will assume for the remainder of the 
paper that Bro will be used. 
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4.1 Server Side 

The 1SSHD server is modified OpenSSH code that provides 
events for further logging and analysis. Within the SSHD 
application (as described in §3.2) there are ideal locations where 
we extract information about user activity. Such information 
includes login and authentication data, session and channel 
creation, port forwarding, and keystroke/application data. This 
data is normalized in terms of data types as well as being formed 
into structured text. This text is then written to a local socket 
(provided by stunnel) using a non-blocking descriptor. Details of 
this process follow. 


For events, a number of data types are defined. Not unexpectedly 
these types map approximately with the native data types defined 
by Bro. This includes the usual integer, string and count as well 
as more network specific types like address and subnet. In order 
to encapsulate arbitrary data, both unstructured string and binary 
data is URL encoded using the stringcoders library [8]. This 
mechanism is used in reproducing user activity since even simple 
terminal sessions include Unicode characters and colors. An 
additional benefit of URL encoding 1s to safely encapsulate traffic 
that might be directed toward either the analysis system or the 
terminal session of the individual doing the analysis. Original 
versions of the instrumentation attempted to remove non-printing 
characters from the recorded data, but information loss and textual 
confusion ultimately pointed toward the URL encoding solution 
as a better option. 


As has been already described, the most basic unit of information 
provided by i1SSHD is called an event. Events, as their name 
implies, are single actions or decisions made by a user that are 
agnostic from a security analysis perspective. Lines typed by the 
user as well as logins and channel creations are all examples of 
events. 


For event creation, all activity points to a single function. This 
reduces confusion and creates a single point for information 
gathering. A sample function call looks something like: 


s audit("channel new", "count=%d count=%i 
uristring=%s", found, type, tlbuf) ; 


The function s audit is the general event handling operation 
within iSSHD. There are three sets of arguments that it takes — 
the first is just the event name (in this case “channel new”). The 
second defines data typing for the Broccoli interpreter and has 
printf() type structure. Any additional arguments define the data 
associated with the event type. Here, ‘found’ is the index for the 
free channel slot, ‘type’ defines the type/state of the channel (ie: 
SSH CHANNEL LARVAL, | SSH CHANNEL AUTH SOCKET), and 
‘tlbuf’ is the URL encoded channel name such as server-session 
or auth socket. After passing through the Broccoli interpreter, an 
event named “channel new” will be created with three arguments. 
Note that there is no indication that the channel creation is 
considered a good or bad thing — such a determination will be left 
to the analysis side of the iSSHD. 
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Figure 3: Overall iSSHD Architecture 


Data provided by keystroke logging presents an interesting 
problem in that the content can be of arbitrary length, and will 
contain non-printing ASCII characters. To avoid inefficiencies, 
we cache keystroke data in a channel buffer queue using the 
native channel buffer types until a new line character is seen or 
data volume is exceeds a threshold. In situations where too much 
data is generated on the server side (such as large compile runs), 
the value of this additional data is almost zero. To address this, 
we adopted the same idea as used in the network Time Machine 
[7]: specifically that most security sensitive data and events tend 
to cluster them selves to the beginning of interactive sessions. By 
making the distinction between interactive sessions (where there 
are roughly the same order of magnitude of client initiated data 
events as server) and highly asymmetric connections (with dozens 
or hundreds of server data events per client data event), we can 
avoid excess resource consumption by the iSSHD. This is one 
situation where it was necessary to build logic into the code 
running in the 1SSHD. Table 1 provides cutoff values for both 
normal tty channels as well as channels not bound to a tty. For the 
situation of non-tty communications, the ratio of printing to non- 
printing characters is also looked at to avoid needlessly copying 
binary files. 
Table 1: Default cutoff values for user and server data. 


Yes | Max line length or line count for client | 15 lines, 64k 
input between server inputs. bytes 

Yes | Max line length or line count for server | 15 lines, 64k 
input between client inputs. bytes 


No Initial sample value (ISV) before | 1024 bytes 
determining binary data. 

No Maximum data in total for either client | .5M bytes 
or server inputs. 


O Percentage of non ascil-printing 
characters, after ISV, allowed for 
continued sampling. 
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For example if a user (client side) types ‘Is -l’ in a normal tty 
based login, the iSSHD would provide the server echo of ‘Is -I’ as 
well as the next 14 lines or 64k bytes of server side output 
(whichever is exceeded first). The line/byte count is reset every 
time client data is processed. The cutoff values are modifiable at 
compile time and are set somewhat conservatively since the 
assumption is that there is a large number of 1iSSHDs feeding into 
a single analysis system. 


4.2 Data Analysis 


Data analysis consists of any component except for the 1iSSHD 
itself. Practically it can be thought of as the stunnel as well as the 
bro instance and related policy. 


The stunnel is not particularly interesting in that we are using it to 
transport data from an open file descriptor on the 1SSHD side, to 
the analyzer host. Since this is just a simple implementation of a 
well-known application, we will focus on the details provided by 
the policy. 


The bro policy is designed to track individual sessions and 
whatever activity is contained within them - normal shell sessions, 
remote code execution or subsystem invocation. Each session 1s 
defined by the start of the ssh connection and continues through 
any activity until that connection ends. The series of events for a 
routine login looks something like Figure 4 when printed directly 
from the iSSHD. 


Each of these lines represents an event and the data associated 
with it. Policy can be written to trigger on specific events, their 
data, or both. Of obvious interest is a users keystroke data and the 
systems response. Since we have direct access to near real time 
keystroke information, we look for extremely unlikely - and 
highly suspicious - character sequences. These might include 
known toolkit signatures, abnormal root shell prompts for /bin/sh, 
or any other unexpected commands. Sets of commands that 
individually do not represent a significant interest, but which are 
suspicious in total represent the second type of alarm. These two 
categories are defined by two sets of signatures — the first for 
commands or strings worthy of immediate notification, and the 
second for sets of these commands or strings present in the user 
session. 


In order to circumvent logging from the system login() facility, it 
is not unusual for attackers to remotely execute a shell via ‘ssh 
host sh -1’. This style of reconnaissance has become so common 
during hostile activity that we made sure that it could be simply 
alarmed and all interactive data recorded. To address this, traffic 
on non-tty channels had to be tracked and analyzed since the tty 
invocation is part of the standard unix login() facility. Since data 
on these channels can include binary streams, the ratio of ASCII 
to non-ASCII packets is monitored. If after a pre-defined 
sampling window this ratio exceeds a_ threshold, further 
monitoring on that channel is dropped. We have experienced 
tremendous success in logging both the remote execution of shell 
binaries as well as monitoring commands to and from such 
occurrences. 


SSHD_ CONNECTION START 


AUTH KEY FINGERPRINT uristring=0x.. uristring=DSA 
AUTH INFO uristring=Accepted uristring=scottc 
uristring=publickey 


SESSION NEW uristring=SSH2 

CHANNEL NEW count=0 count=SSH_ CHANNEL LARVAL 

uristring=server-session 

SERVER_INPUT_ CHANNEL OPEN uristring=session 

CHANNEL NEW count=1 count=SSH CHANNEL AUTH SOCKET 
uristring=auth+socket 

SESSION_INPUT_CHANNEL REQ count=0 

uristring=auth-agent-regq@openssh.com 


SESSION INPUT CHANNEL REQ count=0 uristring=pty-req 
SESSION INPUT CHANNEL REQ count=0 uristring=shell 


CHANNEL DATA SERVER count=0 
uristring=%0ALast+login:+Satt+tJan++8+14:45:31+2011 
CHANNEL DATA CLIENT count=0 uristring=exit 

CHANNEL DATA SERVER count=0 uristring=exit 

CHANNEL DATA SERVER count=0 uristring=*s0Alogout 


SESSION_EXIT count=0 count=28221 count=0 
CHANNEL FREE cCount=0 uristring=server-session 
CHANNEL FREE Count=1 uristring=auth+tsocket 





SSHD_CONNECTION_END 
Figure 4: Event series for a shell login. 


The final area to explicitly mention is the ability of 1iSSHD to 
intercept authentication data. When considering our options for 
recording passwords during authentication, we ended up having to 
carefully balance the utility and risk of retaining the data. In the 
context of a forensic analysis, a password might be tremendously 
valuable if used in a legally sanctioned criminal investigation. On 
the other hand having such valuable credential information in the 
logs represents a huge risk in and of itself, even without taking 
into consideration passwords recorded for other institutions by 
users transiting local systems. Ultimately the decision to record 
passwords is left to the local site as a configure time option so that 
it cannot be adjusted without recompiling the iSSHD. Since it is 
not unusual for sites to share lists of known compromised keys via 
their fingerprints, public keys presented for authentication can be 
compared to a list of known bad keys and alarms raised when a 
suspicious key 1s seen. 


4.3 Event Details 


As previously suggested, events generated by the iSSHD are 
without any sort of predefined notions of good or bad since it is 
the role of the analyzer to interpret these events. These events can 
be roughly grouped by function, with types auth, channel, session, 
server and sshd. In addition to these, the sftp subsystem also has a 
number of events associated with it. 


The example presented in Figure 4 shows the series of events seen 
in a “normal” login. Two of the most important in terms of 
monitoring and analysis are CHANNEL DATA CLIENT and 
CHANNEL DATA SERVER. These events provide unfiltered client 
keystroke and server echo/response data. If a user types 
“‘Iz<backspace>s<enter>” you would see “1z%7Fs” from the client 
side and “1z%08+%08s” from the server side in the URI encoded 
data. The characters ‘%7F’ and ‘%08’ are the control characters 
delete and backspace respectively which can be seen from 
standard ascii definitions. Since we assume all user-generated 
data is potentially hostile, we reduce the possibility of accidentally 
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interpreting control characters in the process of reading and 
interpreting the data by storing it in an encoded form. 


Each event also includes timestamp, server id (process ID + server 
hostname + listening port), client id (32 bit random number) and 
interface address list. This information is tracked by the analyzer 
bro policy as a locally unique session identifier - for example 
#12345. This session id will remain constant for any activity 
attached to that users session. This event data is missing from 
figure 4 (and the other session figures) to allow for better clarity. 
Additionally, data is maintained for the channel id so session #12 
might contain channel 0 and channel 1. Since the session object 
holds channel objects, the session id (ex #12) is the same and the 
channel identifier will be different. A small number of events, 
mostly connected to the running sshd daemon itself, do not have 
all these fields since there is no notion of client session to be had 
when the daemon is starting or emitting a heartbeat event. 


5. RESULTS AND PERFORMANCE DATA 


Presenting quantifiable results for the 1SSHD is somewhat 
complicated since there is no control data to base comparisons 
against. Since the number of incidents is not large, checked 
against a control group or varied across sites, it presents more of 
an anecdotal story than an effective hypothesis test. Using 1iSSHD 
we have identified approximately three-dozen instances of stolen 
credentials. Most of them are not particularly interesting, but at 
the same time we can catch this class of attacker before anything 
can get interesting. Because of this, we will present an unusually 
qualitative analysis for the security and policy enforcement 
capabilities. For performance data we will look at a number of 
measurements comparing iSSHD to an unmodified version 
running on the same hardware. In addition we will also provide a 
simple analysis of aggregate user events that would be extremely 
difficult (or impossible) without the data _ set. 


Besides detection, the iSSHD provides considerable insight into 
the tactics, skill levels and motivations for many of the attackers 
on our systems. In many cases the forensic logs quickly provide a 
clear indication of the success, skill level and threat presented by 
an intruder. 


5.1 Sample 1: Remote Shell Invocation 

Figure 5 provides a textbook example of a “classic” stolen 
credential and local exploit attack. This user (resu) made the 
mistake of having the same password for at least two sites - 
NERSC and the remote site that was compromised. Here the 
attacker remotely executes a shell to log in, then attempts a local 
linux exploit. Note that because of the shell invocation, 
communications are not via the normal tty interface - a technique 
detailed in §4.2 . 


Details follow with some of the data fields removed for clarity. 


el 


AUTH OK resu keyboard-interactive/pam 
Lele heeds com > -0.0.0.0s27/rep 
NEW SESSION SSH2 

NEW CHANNEL SESSION exec 
SESSION REMOTE DO EXEC sh -i 
SESSION_REMOTE EXEC NO PTY sh -i 

NOTTY DATA CLIENT uname -a 

NOTTY DATA SERVER Linux comp05 2.6.18-..GNU/Linux 
NOTTY DATA CLIENT 
NOTTY DATA CLIENT 
NOTTY DATA CLIENT 
NOTTY DATA CLIENT 





2 
3 
4 
5 
6 
7 
8 
9 
10 
LL 
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NOTTY DATA CLIENT wget 
http://host.example.com:23/ab.c 

NOTTY DATA_CLIENT gcc ab.c -o ab -m32 

NOTTY DATA CLIENT ./ab 


NOTTY DATA SERVER [32mAcl1dB1tCh3z [OmVS Linux 


kernel 2.6 kernel Od4y 
NOTTY DATA SERVER $$$ Kallsyms +r 
NOTTY DATA SERVER $$$ K3rn31 r3l3as3: 
42 6e10= 194. 11.3 selon=pert 
NOTTY DATA SERVER ??? Trying the 
FOPPPPppppp_ m3th34d 
NOTTY DATA SERVER $$$ LOOking f0r knOwn 
CArGstZi« 
NOTTY DATA SERVER $$$ cOmput3r 1z aqulring n3w 
je ale fea cere 
NOTTY DATA_SERVER !!! u4b13 tO flnd t4rg3t!? 
W3'1ll s33 abOut th4t! 
NOTTY DATA CLIENT rm -rf ab ab.c 
NOTTY DATA CLIENT kill -9 $$ 
SSH_CONNECTION END 1.1.1.1:52073/tcp > 
0.2020. 0222 / cep 


Figure 5: Remote shell invocation example. 


We can see a number of clear indicators that something 1s going 
on which is not normal user activity. First is the interactive 
session on a non-tty channel created by remotely executing a shell 
(line 3-5). Second, the unset HISTFILE command and the 
creation of a directory called “...” under /dev/shm (line 8-10). 
Finally the exploit 1s downloaded, compiled and (unsuccessfully) 
run (line 12-21). Highlighted text represents commands and 
output that as part of the default policy distribution are considered 
sufficiently unusual or dangerous to warrant alarming on. 


5.2 Sample 2: Cluster Reconnaissance 

This example is one of the more complex and educational that we 
have captured, providing a clear snapshot of the methodology and 
tactics taken by a pair of hackers looking into our systems. Since 
they are sharing a common login via the GNU screen utility we 
can see the interaction between them and get an understanding of 
their methods and communication, something quite difficult under 
normal conditions. While there are several thousand lines of 
interaction from the event, space limitations force us to only 
include a small chunk of the most interesting (and amusing) lines. 


DATA _CLIENT /sbin/arp -a 

DATA_SERVER b@n:~> /sbin/arp -a 
DATA_SERVER comp05 (192.168.49.94) at 
00:00:30:FB:00:00 [ether] PERM on ss 





DATA_SERVER 
DATA CLIENT 
DATA_SERVER 


b@n:~> 
oh wow 
b@n:~> oh wow 


DATA_SERVER b@n:~> /sbin/arp -an |we -l 

DATA_SERVER 97387 

DATA CLIENT rofl hax it hacker 

DATA_SERVER b@n:/u0> sorry, im gonna s roll 
a cigarette and smoke it, y 

DATA_SERVER b@n:/u0> then im gonna come back 
and try to Aack ok 7 

DATA_SERVER b@n:/u0> i am gonna go for one 

DATA_SERVER b@n:/u0> you cant smoke inside? 
terrible 

DATA SERVER b@n:/u0> its f cold as f*** 


Figure 6a: Initial communication and Note: removal additional 
server fields, time and session id 





The text from the screen session is marked in blue, and event 
names are once again bolded. The overall behavior can be 
broken out into several sections. In Figure 6a, lines 1-10, arp 
tables are used to identify locally attached systems. In this case 
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the large number of them (9787) seems to cause the need for a 
few moments thinking about how to proceed. This is one of the 
initial indicators that the attackers are not just blindly running 
tools. It also indicates that they are probably in the western 
hemisphere. 


DATA_CLIENT hmm cd ;ssh-keygen -t 


DATA_SERVER b@n:~/.ssh> hmm 
DATA_SERVER b@n:~/.ssh> cd 
DATA_SERVER b@n:~/.ssh> ssh-keygen -t dsa 
DATA_SERVER Gen pub/private dsa key pair. 


DATA_CLIENT 1s 

DATA_SERVER b@n:~/.ssh> ls 

DATA_SERVER id dsa id_dsa.pub known_hosts 

DATA CLIENT cat id _dsa.pub > authorized keys 

DATA_SERVER b@n:~/.ssh> cat id_dsa.pub > 
authorized keys 

DATA_CLIENT ssh -oHashKnownHosts=yes 
192.168.0061 

DATA_SERVER b@n:~/.ssh> ssh 
-oHashKnownHosts=yes 192.168.0.1 

DATA_CLIENT cat > ssh_cn010onf 

DATA_SERVER b@n:~/.ssh> cat > ssh_ config 

DATA CLIENT cat known_hosts | grep -v 
LOL COD acd. 

DATA_SERVER b@n:~/.ssh> cat known_hosts | 
grep =-v 192.168.0.1. > tmp 


DATA_SERVER b@n:/tmp> what are you trying to 
do get ride of t pressing yes? 

DATA_SERVER b@n:/tmp> clearly 

DATA_SERVER b@n:/tmp> lol set known_hosts to 
dev null n0Ob 
DATA_SERVER b@n:/tmp> that is such a hack 
and completely improper 

DATA_SERVER b@n:/tmp> and a good way to lose 
a box if you forget to remove it 
DATA_SERVER b@n:/tmp> nononosec phrack.org 
done? wn? its in issue 64 








Figure 6b: Generate local key pair and populate across NFS 
share, attempt generic NFS type attacks via suid 0 program. 


DATA_CLIENT ps axuw |grep snort 
DATA_SERVER pS axuw |grep snort 


DATA_SERVER b 36684 0.0 0.0 2740 564 pts/10 
Sr 20559 US00 grep. Snort 





Figure 6c: Looking for IDS processes. 


By Figure 6b discussion has indicated a familiarity with insecure 
multi-host NFS file systems - interestingly, they did not attempt to 
use NFSShell. From here (lines 4-5) the pair generate a pass- 
phraseless ssh key to use across the systems sharing the home file 
system, once again indicating a familiarity with shared file 
systems and how they can be used. They grapple a bit with 
configuration issues and interestingly use the HashKnownHosts 
option to obscure records left in the known_hosts file. Figure 6c 
provides an example of IDS detection. 


Ultimately this pair logged in to 19 local systems and never 
managed to get root access. The dialog here is as long as it is in 
order to convey the relative sophistication and interesting method 
of the attackers. 


LISA 711: 25th Large Installation System Administration Conference 115 


116 


5.3 Performance Data 


There are numerous points of reference in comparing the 
performance of the iSSHD with an unmodified OpenSSH. In this 
case we will be looking at aggregate remote command execution 
time, time to copy binary and ascii files, cpu usage for general 
activity, and memory usage for the child process. 


This command set is run remotely via remote execution with the 
system time command providing information about total 
execution time, system and user cpu usage. We recognize the 


differences between remotely executing a script containing 
commands and manually running them. Ultimately we chose to 
run via the script for repeatability and ease of use since tools such 
as Expect do not provide additional functionality. 


SSHD | 42.78 [0.05] | 9.85 [0.11] | 0.70 [0.01] 


ISSHD | 43.03 [0.18] 9.85 [0.15] 0.69 [0.02] 





Table 2: Run time values for three tests, values in seconds, standard 
deviation in brackets. Average remote command execution time 
increases by 0.6%. 


For Table 2 column 1, “Remote Exec” is a set of 13 remotely 
executed commands including normal user activity like ls, touch 
configure and make. From a simple ratio test, the iSSHD takes in 
total about 0.25 seconds more to run or about 0.6%. This 
indicates that the average behavior of interactive shell commands 
should not be adversely affected, but limited variations in 
keystroke responsiveness could be lost. Given the way that large 
volume logging is done (as described in §4.1), this is not at all 
surprising. For the additional columns in Table 2, we have the 
time to completion values for using scp to transfer a medium size 
ASCII file as well as a medium size binary file. In this case, 
medium size is on the order of IOOMB. In each case the 
additional overhead caused by the memory copy and transmit did 
not provide a significant (or measurable) difference in the 
measured time. In this measurement, the same file was moved 
from one directory on the local system to another 40 times in a 
row. The task was then repeated with the 1SSHD to reduce the 
influence of variable overhead and caching. 


Looking at CPU usage for the same two data sets demonstrates 
differences in application behavior. First, the system CPU 
dominated the total time by ~ 4:1 for total CPU time per 
transaction. This is not surprising given that the majority of this 
activity is driven by read() and write() calls as well as polling 
during periods of inactivity. 


Figure 7 shows the relationship between execution time and CPU 
time for both sets of test runs. One thing to notice is the slope of 
the linear regression curve. Total CPU usage decreases since the 
faster you move a constant set of data, the harder the data must be 
pushed during the (shorter) time window. The product of the two 
terms as a histogram we see a very tight set of values (s°), 
implying this relationship. 


The final metric is memory use, which ends up being quite 
consistent both in terms of native and iSSHD when looking at 
results from the data generation scripts. Within SSHD, there are a 
limited number of ways that memory becomes allocated once a 
session completes initialization — the most common being internal 
data buffering and channel creation. In both of these cases the 


LISA 711: 25th Large Installation System Administration Conference 


size growth is minimal for the modifications made since data 
buffering from interactive sessions are cleared once they are 
written to the stunnel socket. 


Total CPU 


© iSSHD 
© SSHD 


55.5 56.0 


CPU (sec) 


55.0 





296 297 298 299 30.0 30.1 302 30.3 


Time (sec) 


Figure 7: Total CPU time vs. length of transaction time for test 
data runs against iSSHD and native SSHD. 


The overall conclusion is that the changes made to introduce 
instrumentation into iSSHD do not have a significant impact on 
performance or usability. 


5.4 Overall Observations 

Overall the iSSHD project has provided insight into probably 
three-dozen compromised user accounts since 2009. In each of 
these cases it was possible to not only quickly determine the 
success of the attack, but also get exploit tools and code used. 


Max Channel Count/Session Per Day 


Max Channel Count 





0 5 19 15 20 25 30 


Time (days) 


Figure 8: Distribution of maximum channels/session for November 
2010. 


As suggested in the introduction, the 1SSHD also provides a 
tremendous source of measurement data as well. We have not yet 
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begun to fully explore this avenue, but there is no technical reason 
why we could not use this to identify needs for the user 
community. An example of this would be to systematically 
explore port-forwarding behaviors to see if we could deliver 
network services differently. Besides problem solving, the 
measurement data can also provide an interesting repository of 
pure research data. Figure 8 provides an example of the 
maximum channel count per session per day during November 
2010). It is interesting to note that some users are exceeding 50 
channels per session — in this case the majority of this is web 
browsing. This might be done (for example) to visit social 
networking sites blacklisted by a users local institution. This has 
interesting security repercussions to be sure. 


6. FUTURE WORK 


Since the iSSHD 1s relatively new, there is a great deal of learning 
going on with regard to what information is useful as well as 
available. There are several areas that we are actively looking 
into for future releases. The first is the detection of local terminal 
session hijacking as described in §2 by [17][18]. The second is 
the extraction of keystroke data from the X11 x-terminal data- 
stream, which is currently opaque. There is currently some 
prototype work completed for the session hiyacking (detailed 
below), while tapping into the X11 stream represents a possible 
way to look into the protocols being tunneled over the ssh 
channel. 


6.1 Local Session Hijacking 

In the available literature and toolkits, there are a number of ways 
that a local attacker can tap into a running session and “reach 
across” the network to access further systems and resources. In 
particular this can be done to elevate privilege if the user has 
gained root access on the external system, or to hop over one time 
password authentication. We are familiar with examples of the 
later. 


In the SSH-Jack application [17], ptrace is attached to the ssh 
client process, finds the channel setup code, then patches the 
memory to request a remote shell attached to a local TCP socket. 
The user running the ssh client is completely unaware that this is 
happening since they are running under a different set of channels 
in the same user session. We are hoping to look for an unusual 
ssh_session2 open() call and match it to the expected state for a 
normal session to help identify this attack. Regardless of this, the 
entire communications from the new channel will be logged and 
analyzed in the same way that normal user activity is. 


A more common attack involves a local root user looking to jump 
off the compromised host through some sort of multi-factor 
authentication. In many cases this involves the opening of the 
victim users terminal descriptors for standard in, out and error 
then writing data directly into the sockets. The running ssh is not 
even aware that anything is amiss since it 1s just transiting data 
normally. We are looking to use the Linux inotify interface [2] to 
monitor and log additional file open events on the terminals file 
descriptors. This is still in its prototype phase. 


7. CONCLUSION 


We have presented an instrumented version of the OpenSSH 
application that allows for a local site to log and analyze user 


activities on local HPC resources. This analysis can be used to 
enforce local security policy with respect to SSH usage, which 
would otherwise be difficult or impossible with normal tools. 
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returned data types is on the right. All events are processed in the channel_post_fwd_listener listen port, path/hostname, host 
port, type 


type, wildcard bind, host, port 
to connect, listen port 
command, username 
channel socks5 id, path/hostname, host port, s5 
Events command 
; a 


Appendix 1 type, listening port, 
This is an abbreviated list of iSSHD events current as of January path/hostname, remote ip, 
2011. The event name is in the left column and a summary of ea remote port 


current public release of the policy set. The description in the 
Returned Data column does not include all the default fields 
described in §4.3. 


auth invalid user userid 


ae 
Channel Events Returned Data Se ea 
parent pid, command 
orig host, orig port, dest host, 
dest port, session id 
channel_notty_analysis disable | printable/non-printable ratio for 
non-tty channel exceeds set 


ratio 


channel _notty_client_data URI encoded non-tty client 

channel notty_ server data URI encoded non-tty server sshd_connection_end remote ip, remore port, local ip, 
data local port, client id 

channel_pass_skip id of channel where pass skip sshd_connection_ start remote ip, remote port, local ip, 
happened local port, parent pid 


channel_port_open eRe: local ip, local port 
path/hostname, remote ip, sshd_ restart local ip, local port 


remote port sshd_server_heartbeat 
| channel portfwd req sd portfwd_ rec | hostname, listening port =| listening port sshd_start local ip, local port 
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Abstract 


In May 2011, we embarked on an ambitious course — in 3 
weeks: clear out a small, soon to be demolished, research 
datacenter containing 5 dozen research systems spanning 
5 research groups and, along with a new faculty mem- 
ber’s systems located off-site, move it all into another 
space suffering from 18 years of accumulated computer 
systems research history. We made it happen, but only 
after intensive pre-planning and after overcoming a num- 
ber of challenges, both technical and non-technical, and 
suffering a moderate amount of bodily injury. 

We present an account of our adventures and examine 
our work in facilities, networking, and project manage- 
ment and the challenges we encountered along the way, 
many of which were not primarily technical in nature, 
and evaluate our approaches, methods, and results to ex- 
tract useful lessons so that others may learn from our 
reckless ambition. 


Introduction 


In late 2010, the EECS Department revisited previously 
shelved plans to renovate the north half of the 5th floor 
of the Computer Science building, Soda Hall, including 
demolition of 600 ft? of datacenter space (“530”) shared 
by 5 active research groups and | defunct group. Merci- 
fully, the new plans left alone a network closet that had 
been originally slated for a relocation but proved far too 
costly to move.! The demolition schedule shifted a bit 
but, by mid-March, eventually settled down around late 
May/early June with a construction start date of 2 June 
announced in late April. 

Concurrently, we had a new faculty member bringing a 
rack full of research systems from a nearby Industrial Re- 
search Lab in Downtown Berkeley (“IRB”) that needed 
datacenter space and needed to move by 25 May. We also 
folded this additional, smaller, migration into the overall 
plan. 


With a larger campus and departmental re- 
examination of the utilization of datacenter space, 
we also saw this as an excellent opportunity to clean 
up, reorganize, and plan to upgrade our remaining 
datacenter space. 

Others have examined the topic of large datacenter- 
scale change, most specifically Cha who looked pri- 
marily at the computing side of migration, especially 
system configuration, while we tightly controlled the 
amount of system configuration change and were also 
heavily involved in the facilities/physical plant side of 
the migration.[2] Similarly, Cumberland focused exclu- 
sively on system installation and configuration while our 
systems were already up and running and could not be 
wiped arbitrarily.[4] 


Location Location Location 


In evaluating new locations for migrating systems, we 
considered 5 major characteristics: 


e Air Conditioning (measured in tons”) 
e Power (measured in kVA°) 
e Space (measured in number of racks) 
e Existing Network Access 
e Ease of Moving Systems 

We considered 6 locations for evaluation, but only one 
made sense for relocation — a 1000 ft” datacenter space 
on the fourth floor of Soda Hall (“420A”). It had several 
points in its favor, including the most surplus cooling 
capacity and, after a number of upgrades, power, some 
pre-investment in overhead fiber distribution systems‘, 
sufficient connectivity to relevant networks, and physi- 
cal proximity to related research groups, staff, and 530. 
Other facilities were either already or about to be filled 
to capacity, poorly suited for experimental systems re- 
quiring frequent physical access, accessible by too large 
a group of people, or lacked sufficient network access. 
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Like all such facilities in the building, both used raised 
floors with underfloor forced air cooling sharing space 
with underfloor power distribution and legacy network 
fiber runs. They were similar in most ways with 420A 
being a larger version of 530. 

Table 1 summarizes the nominal capacity and actual 
usage of 530, IRB, and the final destination, 420A. At 
first glance, these numbers seem to indicate that mov- 
ing these systems into 420A would run very close to if 
not right up against the rated capacity on Cooling and 
Power, but table does not take into consideration the later 
removal of defunct systems and the occasional but gen- 
erous rounding up of usage numbers by Facilities staff. 


Rated | 30 (2x 15) 50 » 
Used 10 25 


FIRB Used aso 
IRB Used Rated | 30 (2x 75) 100 
Used 13 45 - 


Table 1: Facilities Utilitization Summary 





While the best (or least-worst) choice, 420A still had 
several points against it, all centered upon its age, in- 
cluding an 18 year-old raised floor that had never been 
cleaned, 18 years of accumulated cable tangle stem- 
ming from minimal management of underfloor power 
and legacy fiber distribution, a multitude of circuit types 
instead of a single standard, 18 years of systems research 
history (aka “junk’’), and a disorderly mix of standalone 
cage racks and relay racks. 

Though the Computer Room Air Conditioners 
(CRACs), essentially in-room chilled water heat ex- 
changers installed in pairs in each room, were nomi- 
nally up to the task of handling the additional load, they 
too were 18 years old and not running at maximum ef- 
ficiency. In fact, the manufacturer sold off that divi- 
sion shortly after Soda Hall opened in Fall 1994, and 
we are no longer able to get manufacturer replacement 
parts such as logic control boards — many of these units 
have had custom replacement boards installed by a third 
party vendor. A mix of “ownership” issues (the campus 
physical plant, not the department, manages the building 
HVAC system including the CRACs) and budgeting is- 
sues (the availability of funding for operational expenses 
versus funding for capital expenses in physical plant’s 
budget) complicate outright replacement. 

Power distribution in both locations consisted of an in- 
room PDU taking 3-phase input feeding under-floor runs 
of both rigid metal and armored flex-conduit (referred to 
as “cable snakes” locally) carrying single phase circuits 
ranging from 15A to 30A and 120VAC to 208VAC. Af- 
ter 18 years of use by a succession of resource-hungry 
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computer systems research endeavours (each with its 
own power requirements) with minimal efforts to man- 
age the power distribution, “cable snake” became an in- 
creasingly accurate term as the underfloor area devolved 
into an increasingly difficult to manage tangle of flexible 
conduit with many circuits disconnected but left under 
the floor, interfering with orderly air flow and creating a 
maintenance nightmare for the current staff. 

We could address these problems given enough ef- 
fort, which we had; time, which we had in short sup- 
ply; and funding, which we had but is complicated in 
our academic environment though eventually tractable 
given enough time. For our immediate needs, this facility 
badly needed a thorough cleaning, a complete electrical 
survey before making any needed changes, and another 
thorough cleaning — these would be our immediate prior- 
ity while other concerns would have to be dealt with as 
longer-term projects. 


Timing is Everything 


In our academic environment, we try to schedule major 
work involving downtime after May for many reasons. 
Finals, class projects, post-semester research retreats, 
conference/journal submission deadlines, and even VLSI 
tapeout schedules can all key off of the end of the 
semester, but we can engage in major work with relative 
impunity in the brief 2-3 months between the spring and 
fall semesters. Unfortunately, this also applies to major 
construction work, so we had to choose which of March, 
April or May would be the least disruptive time to ac- 
complish this feat. 

We went with May, partly due to circumstance, partly 
by design. Availability of the department electrician and 
a trio of work-study student staff would prove crucial, 
but they were committed to other work until May. Look- 
ing for an upside to this, we found this gave us more time 
to prepare and plan so that, when May rolled around, we 
could spend more time working instead of backing out of 
costly on-the-spot decisions made with little forethought 
or waiting to work because we had not thought some- 
thing through extensively enough. 

This choice had obvious downsides. While the demo- 
lition schedule carried a 2 June deadline, 21 May proved 
much more relevant due to off-site research retreats and 
long-scheduled staff travel in the last two weeks of May 
which gave us 3 weeks of time with the entire team 
present to do the actual work of prepping the new space 
and moving systems while power shutdown of 530 was 
scheduled for 25 May to allow for dismantling and dis- 
posal of the two CRACs. This aggressive schedule came 
back to bite us once or twice, but the hard and very 
real deadline proved to be very strong motivation for us 
and gave us greater ability to cut through bureaucracy — 
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pushback from those outside our team or claims that we 
“could just get an extension” were rebuffed with a re- 
minder that contractors were arriving on 2 June and that 
delay would hold up a major construction project, and, if 
necessary, an invitation to discuss the matter with the de- 
partment chair. This, however, only happened once with 
a colleague who was unaware of the entire scope of the 
work and was quickly handled. 

Our faculty, in particular, left us alone to do this work 
and did not question the migration schedule. While they 
already trust us in general on operational matters, they 
had specifically been informed of this work by the de- 
partment chair and the facilities director to prevent ex- 
actly these sorts of questions — there was faculty buy- 
in to this adventure before it even became our concern. 
While we may not have particularly wanted to go down 
this road, at least this road had already been paved for 
us. Of the 2 other active groups who had systems in 530, 
one group would be moving into the renovated space and 
the other had only a small handful of systems and did not 
mind much as long as the systems were back up even- 
tually — we did not anticipate nor receive any pushback 
from these two groups on the migration schedule. That 
we received no pushback on schedule from faculty or 
users and almost none from staff still amazes us. 


People Get Work Done 


One immediate challenge we faced was that we have no 
staff dedicated to Datacenter Management or who have it 
as a primary job responsibility. Instead, we have a num- 
ber of staff who do work in datacenters as one aspect of 
their jobs, whether they be systems administrators, fa- 
cilities managers, or network administrators. Most no- 
tably, the systems administrators responsible for com- 
pletion of this project all work directly for 3 of the 5 
research groups affected by this move — the other 2 re- 
search groups did not have systems support staff to con- 
tribute to the overall migration. 

Our team consisted of one Department Facilities Di- 
rector, one Department Electrician, three systems ad- 
ministrators with deep institutional knowledge, one new 
guy who started in May, and three work-study student 
helpers. Missing from our merry band was a network 
administrator to handle the myriad of network changes 
needed for all this — primarily a number of changes in 
VLAN assignments. The staff member most familiar 
with the network involved in this migration had taken 
another position elsewhere on campus in February 2011 
and left behind another network administrator unfamil- 
iar with both the overall topology as well as the platform 
specifics. This hole would come to haunt us later and 
nearly derailed the migration schedule. 

Close ties developed over the past decade between 


team members proved vital to success given our tight 
timeline — having to go “through channels” for change 
requests and having to continually re-establish a shared 
terminology would have crippled our ability to move 
quickly on a tight schedule. We did, however, observe a 
marked discrepancy in the preferred manner of commu- 
nication — while we all relied on 1-to-1 in-person com- 
munication for low-latency high-bandwidth communica- 
tion, we never converged on a single mailing list, wiki, 
or any other particular form of asynchronous collabora- 
tive “hivemind”’. Periodic synchronous standup meetings 
proved to be only the consistent way to keep us all on the 
same page. 


Goals 


With limited staff and time, we had to keep our goals 
modest while at the same time ensure that we allowed for 
future facility improvements and upgrades. As noted, we 
had already decided not to engage in major upgrade or 
reorganization work in 420A. We also had to decide how 
much support to give to systems that did not belong to 
our faculty but instead belonged to the two groups with- 
out systems administrators. 

In the end, we settled on providing a minimum base- 
line level of service of rackspace, power, and networking 
for all systems migrating out of 530 but only systems 
that belonged to our faculty received hands-on support 
from us. We convinced the Department IT Director to 
take responsibility for systems that belonged to a defunct 
research center whose sole faculty member had retired 
years earlier. Of the two groups lacking systems admin- 
istration staff, one chose to hire the department’s User 
Support Group to handle the hands-on migration work 
while the other gave the work to one of their undergrad- 
uate interns. 

For our own faculty’s systems, we settled on 3 service 
guarantees: 


e Max of 1 downtime/system 
e Max of | day/downtime 
e Minimize impact on deadlines 
- do not move everything at once 


We aimed to have systems back up within 24 hours 
after taking them down for migration and, once we said a 
system was back up, for it to stay up barring user needs or 
“normal” routine operational needs such as periodic OS 
patching. We particularly wished to avoid taking systems 
down again to move after announcing that a system had 
been moved and was back up. 

The last goal proved to be the most complex but also 
the one most beneficial to us. We decided early on that 
moving everything all at once in one fell swoop was far 
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too disruptive to our users and risky for us if we took a 
wrong step, SO we instead chose to move systems based 
on when relevant user populations needed them — moves 
more than 2 weeks before a deadline or the day after a 
deadline were acceptable, but not during the two weeks 
before a deadline. This would ultimately benefit us as 
we could pick and choose unused and therefore less crit- 
ical systems to move first in order to test the waters after 
which we could move systems in larger groups. 

For the new faculty member’s IRB systems, we fo- 
cused on: 


e space for racked systems 
e installation of electrical circuits 
e transport on or before 25 May 


We were not immediately concerned with getting the 
systems from IRB up and running, only with making sure 
that there was a location for them and making use of the 
electrician’s time while we still had access to him in May. 

When possible, we allowed our decisions to be guided 
by the pursuit of progress towards a cleaner, more well 
organized datacenter space, but were prepared to make 
well-defined and easily undone short-term decisions in 
order to meet our 21 May deadline. 


Space Planning 


We had actually begun planning for this over a year prior 
when the renovation plans first came across our desks. 
Already dissatisfied with the collection of ad hoc changes 
made to datacenter spaces throughout the building and 
the way that they hampered any growth or reorganiza- 
tion, we all saw this as an opportunity to rip out as much 
cruft, junk, and accumulated history as we could manage 
in whatever time frame we could acquire. 

While we could not muster enough momentum or staff 
time to accomplish significant datacenter cleanup after 
the department shelved these initial plans, the ideas for 
cleanup and upgrade were still fresh in our minds and 
on paper when the department took the renovation plans 
back off the shelf. Additionally, we had already done a 
survey of 530 and 420A in late-2009 as part of a campus 
datacenter utilization survey, so we had a good handle 
on who had what systems in 530, how much power they 
consumed, and how much rackspace they needed. 

Initial migration planning started with a overall sur- 
vey of 530 to review any major changes since the 2009 
survey. We paid special attention to the type of circuits 
we would need in 420A — while individual systems used 
standard IEC 60320 electrical connectors such as found 
on typical PC systems, our in-rack power distribution 
used a variety of means to connect to building power 
including 4 different NEMA twist-lock connectors. We 
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conducted a similar general survey of 420A to confirm 
general impressions of the space and to note things that 
we could correct before May without the assistance of 
the electrician or the need for the trio of work-study stu- 
dents. 

We briefly entertained the notion of installing an over- 
head busbar power rail system, as is now increasingly 
common in new facilities on campus, but quickly placed 
it on the “needs time and money” list. Within the time 
constraints we had, particularly the electrician’s avail- 
ability, we had to make do with more limited incremental 
changes to the existing underfloor power distribution sys- 
tem. Overhead power would become one of many rec- 
ommendations that we would make for future datacenter 
upgrades. 

We also held off on any major upgrade work to the 
air conditioning system, again for reasons of time. In 
place of capacity upgrades, we pursued two alternatives. 
First, the Facilities Director ran two day long experi- 
ments running 420A with | out of 2 CRACs shutdown 
to see if each of the 18 year old pieces of equipment 
could handle the existing load alone — which they did 
without failure. While not a strictly rigorous experiment, 
it demonstrated that we could have enough cooling ca- 
pacity given prudent placement of systems and pruning 
of unused or offline systems. It also enabled us to pur- 
sue a second avenue — maintenance service and overhaul. 
The Facilities Director scheduled two maintenance peri- 
ods for each CRAC involving aggressively proactive re- 
placement of worn parts, cleaning of water piping to and 
from the building’s rooftop chilled water supply, and ser- 
vicing of each CRAC’s 3 compressors. Our Facilities 
Director estimates that this restored about 20-30% of ef- 
ficiency back to the CRACs though he notes that building 
AC makes exact numbers difficult to obtain — in warm 
months, building AC runs more often, creating a shell of 
cooler rooms surrounding 420A while in cooler months 
overall need for heating is rare due to the effectiveness of 
the building’s own insulation. 

We reviewed several ways to rearrange 420A, but ulti- 
mately retained the existing arrangement of 4 main rows 
of racks and | catch-all row with a more concerted ef- 
fort at enforcing hot and cold aisle separation for dense 
installations while relegating less dense installations to 
“warm” aisles. We did rearrange rack allocations so that 
projects, which previously had equipment strewn across 
various disparate racks due to the floorspace equivalent 
of disk fragmentation, could benefit from physical prox- 
imity, thus alleviating network fiber distribution com- 
plexity as well as strengthening project and group iden- 
tity. In addition, we identified equipment from prior 
projects which had been abandoned in place and was el- 
igible for reuse or salvage. 

Once we established an initial rack-by-rack layout, 
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mapping research groups to racks, we worked out which 
systems would go into which rack and ultimately came 
up with a rack-unit (RU) by rack-unit layout. While we 
could not move any systems until May and knew that 
plans could change in an instant, this gave us a start on 
planning in-rack power and network wiring and allowed 
us to organize systems in a more sensible fashion com- 
pared to the previous “which rack has room?” method. 

After running through a few different ways of sorting 
systems into racks, we eventually identified 3 classes of 
systems that aligned naturally with their owners and pur- 
poses that also lent itself to a means to organize the racks 
and to answer the inevitable question of ’ where do I put 
this system?” 


Experimental research systems 
Our faculty’s systems 
Sat on “Research” network 
Novel Hardware is the point 
Often had two of each kind 


‘**~Production’’ish research systems 
Our faculty’s systems 
Sat on “Research” network 
Stable research platforms/services 
Clusters, OS dev, storage 


“KECS” systems 
(mostly) Other faculty’s systems 
Sat on “Department” Network 
Group web servers, SW 


Our eventual rack layout would later reflect these 
alignments and would let us make use of some limited 
“luxury” hardware resources more effectively. For in- 
stance, while we did not have the resources on hand 
to put a UPS in every rack, we did have a new-in-box 
UPS that had been bought but was never used for a now- 
decommissioned system — we installed that UPS into the 
Production rack to provide battery backup to a 24TB 
storage system and to some management systems. Sim- 
ilarly, other racks got “intelligent” power strips with re- 
mote outlet control and per-outlet power metering that, 
while not so useful for research quality results, let us 
keep tabs on spikes in power consumption as researchers 
ran experiments. Meanwhile, racks housing systems that 
we expected would see frequent hardware reconfigura- 
tion got an in-rack KVM so we could avoid having to 
un-rack systems just to get a console on them. Most of 
this hardware (cabinets, power-strips, UPSs, KVMs) was 
re-purposed from prior projects. 

As the end of April approached, we began more defini- 
tive preparations. One admittedly sneakier one was to 
make daily sweeps of systems in 530 to check for run- 
ning user processes and when the last non-staff login oc- 
cured. If no user processes were running and the last 


non-staff login was more than a few weeks prior, we pre- 
emptively shut the system down. We shutdown 15 sys- 
tems this way and only had to turn one back on — the 
remaining 14 remained powered down until we moved 
them at our relative leisure in May. 

Some of the more obvious space preparations included 
basic cleanup of the space — we lost track of the number 
of cardboard boxes we had found squirreled away in ev- 
ery possible corner, nook, and cranny — collection and re- 
moval of abandoned or deprecated systems that had been 
left behind in racks but never tagged as excess, and tag- 
ging of equipment for storage and later repurposing. In 
preparation for the inevitable exodus of unused systems 
from both 530 and 420A that nobody was quite prepared 
to send to the campus “Excess and Salvage” unit, the Fa- 
cilities Director had begun his own cleanup of a large 
basement storage room for our use during the move. 

By mid-April, we had a good handle on the work 
needed to ready 420A — outside of a fair amount of 
hands-on physical labor, we saw no major facilities ob- 
stacles to having space ready for move-in during May, 
leaving only networking left as a major concern. 


Network Planning 


We had three areas of concerns regarding the network 
changes needed to support this move: the changes 
needed, who would do the work, and, as usual, the time 
available. As with planning for other parts of this move, 
we chose to stick with minimal changes instead of an 
overly ambitious redesign. 

The systems moving out of 530 were spread across 
two distinct networks, a ’Department” network, man- 
aged and funded centrally by the department, that pro- 
vided general commodity network connectivity through- 
out the department and a “Research” network, funded by 
research grants and donations and managed by research 
systems support staff until early 2011, which evolved out 
of a wider network deployed for a campus-wide clustered 
computing project to serve the needs of specialized com- 
puter systems research in the Department. The vast ma- 
jority of systems belonging to our faculty sat on the Re- 
search Network while the dozen or so systems that be- 
longed to other faculty sat on the Department network. 

This presented one small but immediate problem. 
While 420A historically supported systems associated 
with clustered and distributed computing research — the 
projects involved in such research provided their own 
network hardware to support the higher density network- 
ing they required — the Department network had very lit- 
tle presence in 420A at all, no more than a dozen ports 
available via network drops pulled in from a nearby net- 
work closet that were meant for one-off systems, not 
for higher density installations. Department network- 
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ing staff did not have any spare equipment to install a 
managed switch to support denser installation in 420A of 
systems on the Department network, but fortunately re- 
search networking had enough spare equipment to loan 
out a switch which someone would setup as a man- 
aged switch attached to the Department network. The 
last detail about it being used as a managed switch 
would change, but this plan would remain otherwise un- 
changed. 

A larger question was what direction to take the Re- 
search Network’s presence in 420A. The Research Net- 
work’s presence in 420A, once extensive to support large 
clustered computing projects, had itself dwindled in size 
as systems’ power and cooling density rose far faster 
than their space and networking density and by this time 
had evolved into a few network stubs supporting smaller 
projects. 

One such stub, a group of 4 daisy-chained switches at- 
tached to the Research Network via a lone 10Gb/s link 
over long-range fiber, was on the right VLANs which 
opened up the possibility of just daisy-chaining even 
more switches. We had previously discussed plans to 
stem the growth of the Daisy Chain of Doom (DCOD) 
but lacking sufficient spare long-range optical modules 
for a second long-range fiber run, extending the DCOD 
was straightforward, predictable, and cheap since we did 
have plenty of switches and short-range optical modules. 
We understood the downsides of relying on what would 
turn out to be a daisy-chain of 7 switches, but felt that it 
would be acceptable for the short-term (6-9 months) un- 
til we could spend enough time planning more extensive 
changes. 

Regardless of any physical topology changes, we 
quickly realized that there would be a significant number 
of changes to port VLAN assignments to support systems 
with private interfaces on a separate VLAN. This led to 
the biggest question — who was going to do the work of 
reconfiguring the switches? 

Prior to 2011, research systems support staff shared 
management access and duties with a lead “Network 
Guy” who himself had worked in our team as a split net- 
work and systems administrator before transitioning in 
2009 to a full-time network mangement position support- 
ing both the Department and Research Networks. Upon 
his departure, he handed over the Research network to 
the remaining network administrator at the direction of 
the department IT Director who wanted to see both net- 
works managed in a more unified manner. 

We were wary of this change, specifically losing ac- 
cess to manage the research network, but, in a good faith 
attempt to support the IT Director’s direction in network 
management, we went along with his request that we 
send all of our network change requests to the depart- 
ment network staff which consisted of the remaining net- 
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work administrator and the soon-to-retire infrastructure 
services manager. While we expected some communica- 
tion and culture differences, we did not anticipate at this 
point the delays that would come with this and nearly 
derail the migration. 

In late April, we met with the remaining network ad- 
ministrator to go over high level plans and examples of 
the changes we wanted so that everyone was on the same 
page. We held one more meeting to go into further detail 
and left with everyone understanding what we needed 
and when we needed it. At this point, every major task 
was identified and assigned to one or more persons. 


Progress 


Once May started, we gained access to the department 
electrician, work-study student labor, and, much to our 
delight, a large storage room courtesy of the Facilities 
Director for anything and everything we wanted to re- 
move from 420A or 530 but were not quite ready to junk. 
Also joining us on 2 May was our new systems adminis- 
trator who showed up for work right as we began the bulk 
of the work. We all met on | May for an initial meeting 
to confirm that we were all on the same page, and from 
that point, work progressed quickly. 

Our first order of business was a detailed electrical sur- 
vey of 420A so we actually had some idea of what cir- 
cuits were actually live. At the same time, we were tag- 
ging equipment either to go to Excess and Salvage or to 
storage for the work-study crew to remove from 420A 
which they did as fast as we could tag it. Within a week, 
the electrician had mapped out all circuits in 420A, made 
all of the initial electrical changes that we had requested, 
and disconnected all hardwired powerstrips that had been 
attached to the legacy relay racks. We would later ask for 
a few changes which were quickly handled. Once this 
work was completed, the work-study crew set to work re- 
moving relay racks so we could bring in cage racks from 
storage that we had setup in a staging area with in-rack 
power distribution and cable management. 

In the second week of May, we had begun to bring in 
networking to individual racks. While we waited for the 
network administrator to configure switches to add to the 
DCOD attached to the Research Network, we worked on 
bringing in more access to the Department Network by 
installing a spare L2/L3/L4 switch in the rack we had 
setup for the two groups whose systems all sat on the 
Department Network. 

As we had anticipated, setting up a optical fiber link 
back to the Department Network did not work out — we 
only had 10 gigabit optical modules for our equipment 
and the Department network staff did not have spare 10 
gigabit optical modules to support a fiber link to our 
switch, only | gigabit modules. Instead, we located a 
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free copper network jack in 420A, connected the switch 
to that, and proceeded to setup a pricey® L2/L3/L4 switch 
as the functional equivalent of a simple 48-port L2-only 
desktop switch. Though arguably a waste of an expen- 
sive piece of network hardware, this temporary loan al- 
lowed the department networking staff the flexibility to 
later provision a switch from their current vendor, get the 
proper optical modules, and configure it more fully to al- 
low them to bring in multiple VLANs as needed. 

At the end of the second week of May, we were ready 
to let the two groups move their systems into 420A. We 
wrote up instructions on how to get their systems up and 
running in 420A and distributed this the next Monday — 
they would be the first people to migrate out of 530. We 
would not be able to move our faculty’s systems out of 
530 until the third week of May due to delays in getting 
our network changes handled. 


Network Lag 


In the second week of May, we experienced slower 
progress with our network requests than we had ex- 
pected. Though we estimated the actual work would only 
take an hour or two, we anticipated some delay due to 
forseeable factors on the part of the network adminis- 
trator such as lack of familiarity with the Research Net- 
work, a strong desire to get everything setup exactly right 
for us, and an already busy work schedule. However, it 
became increasingly clear that the delay would be well 
beyond what we anticipated or could handle given the 
short timeframe. 

Our first experienced a short delay when the network 
administrator pushed back a day on handling our network 
changes; though unexpected, we attributed it to a heavy 
workload due to taking on all day-to-day support for the 
Department Network after Network Guy’s departure and 
felt that we could absorb this delay as we still had some 
facilities work left to finish. 

Though the network administrator had worked with 
and trained on equipment from the vendor used by the 
Research Network, that experience had gone unused for 
a number of years due to the department’s use of a dif- 
ferent vendor. To help overcome this, one of our team 
wrote the switch configurations for the network admin- 
istrator to load onto the switches, saving the network ad- 
ministrator time and transferring responsibility to us if 
something went wrong. 

The final and unmistakable sign came near the end of 
the second week of May when we received notice that the 
reconfiguration of our switches, essentially loading the 
switch configurations we had written ourselves, had been 
reassigned to another member of staff who also had little 
to no recent experience with the network environment. It 
was at this point that we realized that something much 


more fundamental was going on here than just lack of 
time on the part of the network administrator. At this 
point, we only had a week left with all team members 
present to complete the work — any further delays would 
irreparably derail the migration. 

A hour-long meeting with the IT Director revealed that 
he had been unaware of our previous access to manage 
the Research Network and of our relevant expertise and 
that he had also partly based his decisions on inaccurate 
information about relevancy and recency of other staff 
members’ experience. We appreciated his desire to see 
both the Department and Research networks managed 
as a more unified single entity and understood his con- 
cerns that restoring our access to manage the Research 
Network could lead to further divergence, but we could 
no longer tolerate holding up the schedule to accomo- 
date this nor wait for staff who weren’t familiar with the 
systems involved. We reminded him of the lesson from 
from Brook’s “The Mythical Man-Month”[1], namely 
that adding more staff to a project running behind sched- 
ule [or, as we added, running very close to it], particular 
staff unfamiliar with the project, will cause it to fall fur- 
ther behind schedule. 

By the end of the meeting, we had, albeit with caveats 
about keeping the department network administrator in- 
formed, regained administrative access to the Research 
Network hardware and, with passwords literally in hand, 
we walked back up to 420A and started reconfiguring 
switches. That evening, we were up and running and 
ready to start moving our faculty’s systems into 420A 
that night — on-schedule completion once again appeared 
within reach. In the weeks and months after the migra- 
tion, we would return to the question of why it took so 
long to achieve our network goals, regardless of how we 
achieved them. 


Back to Work 


With our network problems resolved, we returned to the 
exodus from 530. With a few days of work time lost 
to dealing with the network management problems, we 
had to toss out a fair bit of our move schedule and start 
doing a lot more ad-hoc scheduling. Luckily, when we 
first started working out downtime schedules with users, 
instead of just setting specific dates, we also asked for 
“OK” windows, akin to space launch windows, during 
which it would be acceptable, though maybe not optimal, 
to shut a system down with short notice. This allowed us 
to aggressively pursue a more dynamic schedule where 
we check systems for user processes, quickly check-in 
with users about short downtimes that morning or after- 
noon, and, barring any objections, move systems in as 
large a batch as we could manage with downtimes hov- 
ering around an hour for each system. At times, users 
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would even proactively inform us that they were done 
with a system and could bear to be without use for some 
period of time, thus saving us the work of regular polling. 

By the middle of the third week, we were back on track 
and even a little ahead of schedule as systems were being 
ferried down from 530 to 420A 3-4 times a day. By us- 
ing a few spare and borrowed rack cabinets in 420A, we 
were able to more easily stage the migration of systems 
out of 530. The alternative of moving whole racks at a 
time (versus opportunistic migration of systems) would 
have been more challenging and would have allowed less 
opportunity for the reorganization of systems in 420A. 

At the same time, we had managed to carve out time 
for one team member and the electrician to make a field 
trip down to Downtown Berkeley to confirm details of 
power and transport of the new faculty member’s sys- 
tems from IRB to EECS. By the end of the third week, 
we were down to just a few systems and lots of supplies 
and equipment left in 530 that were migrated to 420A or 
sent to storage. That weekend, one team member went 
out of town to handle site prep for an off-site research re- 
treat followed by long-scheduled vacation and would not 
return to campus until two weeks later. While the IRB 
systems arrived on 25 May, we consider the day of com- 
pletion for the migration to be 24 May which coincided 
with the first day of work for a newly hired computing 
Infrastructure Services Manager. 


Lessons Learned 


We learned a lot from this adventure. The one thing 
agreed upon by everyone involved as well as by sev- 
eral outside observers is a general sentiment of ’’Never. 
Again. Ever.” This was probably near the limit of what 
could be accomplished with the resources we had avail- 
able — and we came very close to failing. While we had 
options available for every problem we ran into, in some 
cases, some of those options had worse long-term con- 
sequences than failure. We got by on our own resource- 
fulness, persistence, and sheer luck to make up for a lack 
of room within which to fail. We hope that, by showing 
what happens when one runs a migration with the bare 
minimum time and staffing, others may take heed and 
step back from the proverbial cliff’s edge that we danced 
upon for a period of a month. 

A key overall lesson we kept coming back to was 
Plans are just that — plans.” The more time one has, the 
more one can try to bend reality to a plan — the less time 
one has, the more one must bend plans to fit reality. We 
lost track of the number of times some trivial problem 
arose that did not fit into our plan — like a supposedly 
42RU high rack turning out to be 41.75RU high — in- 
stead of losing sleep over it, we found ways to adapt, like 
we did with the shorter-than-advertised rack by moving 
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a server to another rack despite it not completely lining 
up with how we organized systems into racks. By look- 
ing at our plans more as initial guidelines or a starting 
point, we gave ourselves the freedom to deal with small 
unforseen problems without getting hung up about them 
and to look at larger problems not as problems but as 
course corrections. 

The only truly disruptive problem was the delay in 
getting network changes made. While the problems we 
faced are getting a more formal review now that the new 
Infrastructure Services Manager is settled into his job, it 
is clear we had problems with communication, aware- 
ness of staff skillsets, and what could best be described 
as differences in culture. Others in the department have 
cited lack of project management, formal or otherwise, 
as a Skill not explicitly present in our team — though our 
Facilities Director served as de facto project manager to 
keep track of progress and did communicate our prob- 
lems to the IT Director, we found that the IT director did 
not understand the full situation until we had our own sit- 
down meeting with him. More explicit project manage- 
ment would likely have caught and handled these prob- 
lems earlier. For our part, it 1s fair to say we could have 
been more direct and explicit about when we needed this 
work done and could have proactively listed restoration 
of our administrative access to the Research Network 
hardware as an alternative if a deadline were missed. 


The question of “What would you have done if you 
had not received access?” has come up. We find it hard 
to believe that the IT Director would have said, “No.” af- 
ter being presented with the facts, but from a technical 
standpoint, it would have been easy — with physical ac- 
cess to the hardware, we could have the needed access 
via brute force methods.[3] This would have required 
disruptive power-cycling of each switch, but we could 
then do what was needed. From a management stand- 
point, it would have been contentious at best. At the very 
least, we would have informed our faculty and the Fa- 
cilities Director of the problem and what we planned to 
do. It likely would also have perpetuated an image of us, 
whether rightly so or not, as “wild cannons” with little 
respect for authority and process who were difficult to 
work with and ultimately would have contributed to an- 
tagonistic relationships with our colleagues and our man- 
agement. This was not something any of us would have 
considered casually. 

Though not disruptive to the migration, still unfortu- 
nate were the moderate injuries suffered by staff. One 
team member, while working alone after hours with the 
raised floor, bashed his knee on two occasions, result- 
ing in a bruised kneecap that took two months to heal 
completely. Had he suffered more serious injury, it was 
possible no one would have noticed until the next morn- 
ing. His time working late would have been better spent 
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planning and preparing for the next day’s work or even 
sleeping. Another staff member from Shipping and Re- 
ceiving broke two fingers while moving into a elevator 
a cage rack that was still loaded onto a pallet. Though 
trained in use of pallet loaders, he was unaware that racks 
shipped on pallets often come with ramps to aid in un- 
load the rack to avoid having to move the entire package 
like that — while most systems administrators in the de- 
partment have had to deal with this and were aware of 
this feature, we wonder how widespread this knowledge 
is among facilities staff outside of those who work exten- 
sively in datacenters. 

While attributable to the rapid pace and aggressive 
schedule of the migration, these injuries were avoid- 
able indicating that at times some team members pushed 
themselves too hard or that they should have taken a 
more active role in physical tasks delegated to other staff 
to share our knowledge regarding better methods. 

Our team took to heart the first lesson from “The 
Mythical Man-Month” about bringing on new team 
members to quick-paced or behind-schedule projects. 
All members had known each for at least a few years, 
had worked together on other projects, and were familiar 
with each other’s habits as well as with the institutional 
knowledge of how things worked in the department. We 
did not bring our new systems administrator onto the mi- 
gration project until after the majority of the facilities 
work had been finished and we could give him tasks that 
could be done jointly with someone else. We consider 
the case of assigning network configuration tasks to staff 
unfamiliar with the environment to be an example of the 
“Mythical Man-Month” fallacy that Brooks describes. 

Similarly, we confirmed what might be considered an 
obvious notion — that fast-paced schedules are no place 
for in-depth staff training. When the schedule is fast, and 
the room for slack is tight, this 1s no place to be bringing 
in people new to the environment. The training overhead 
of new staff is unaffordably high in addition to the ex- 
tra communication overhead of adding an extra person, 
experienced or not. As with our new guy, we instead 
followed another lesson from Brooks — give more rou- 
tine tasks to new staff to allow more experienced staff to 
tackle the more difficult problems. 

One thing that surprised us but really should have been 
obvious was the lack of convergence on any asychronous 
collabarative systems. Use of e-mail lists were sporadic 
at best, and only technical staff made any use of wikis 
or other web-based systems. As noted, for overall team 
synchronization, the only consistently successful method 
was a periodic standup meeting. Even among technical 
staff, paper notes left on racks often served better for 
passing short must-read notices about a rack or a sys- 
tem while things were in flux while the use of a wiki ap- 
peared more well suited to noting the steady state once 


a system or rack had been successful moved into place 
and deemed stable. This is extension of our previous les- 
son — once a project starts, training project members on 
new systems is inadvisable at best unless the benefit is 
so large compared to the training overhead that failure to 
adopt the new system could result in project failure. 

The lessons we learned could best be summed up as 
’More time, better communication, and slow down.” The 
ambitious plan to completely clean up the target room 
(420A) was cut short due to lack of time. While the result 
was a vast improvement, the time required to perform the 
cleanup was underestimated. In addition, it was unclear 
at the outset what parts (shelves, bolts, power strips, ca- 
bles, etc.) might be needed for the final configuration. 
An earlier start on the clean-up would have helped; at 
least more of the parts could have been sorted in advance 
for subsequent disposition. 


Future Work 


While we were able to accomplish our basic goals of 
moving all systems from 530 to 420A in the time avail- 
able, we still have a great deal of work left to bring 420A 
and other similar facilities in the department up to more 
contemporary standards. Some of this work is tractable 
in the near-term while other work will remain the focus 
of longer-term efforts involving questions of funding and 
staffing. 

Our short-term work will focus on continual efforts at 
clean-up in all facilities. The one constant we have found 
about any datacenter facility, especially in an academic 
research environment, is the tendency for old equipment 
to accumulate and linger around so long that people for- 
get what a piece of equipment is for and, as a result, be- 
come afraid to get rid of it. The one constant we have 
found about buildings with both raised floors and drop 
ceilings is the tendency for those areas to become abso- 
lutely filthy, especially raised floor plenums which cause 
systems to take in large amounts of dust and “biologi- 
cal debris”. Both situations require both immediate and 
ongoing attention to mitigation. We are currently evalu- 
ating in the short-term a number of different options for 
front-of-rack filters. 

Mid-term work focuses on networking, in partic- 
ular the hack that is the 7-switch-long DCOD cur- 
rently feeding almost all of the research systems in 
420A. Current options include a 16-port 1OGbE distribu- 
tion/aggregation switch or setup of a double-ended string 
of switches. For now, the DCOD suffices, but the physi- 
cal path of the fiber is tortuous at best due to the ad hoc 
manner in which the original stub network in 420A grew 
with fiber criss-crossing the room multiple times. Ad- 
ditional switches will only complicate this further and 
limit growth. Related to that is the replacement of the 
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switch currently deployed in a ’production’ role on the 
Department Network. It should at the least be managed 
as a fully configured switch instead of setup as a $6,000 
dumb L2 desktop switch, but ideally would be a model 
from the vendor currently used for the rest of the Depart- 
ment Network so that it can be more easily managed by 
department networking staff. Finally, we are very inter- 
ested in finding ways to promote a more unified approach 
to network management that avoids the maintenance of 
two completely separate network domains. 


Longer term work on the order of a year or more in- 
clude several facilities upgrades. The first and foremost, 
but likely to take the longest is the replacement of the 
in-room CRACs which are nearly 20 years old and no 
longer supported by their manufacturer. Current mea- 
surements indicate that we are already using approxi- 
mately 80% of our AC capacity while we still suffer a 
compressor failure about every 2-3 months in 420A. Re- 
placement requres negotiation with our campus Physical 
Plant and could take a year if not more, but is neccessary 
to support future growth. 


Other work we would like to pursue all relate to power 
distribution. The variety in types of circuits installed 
complicated work — standardizing on a single type of 
circuit, say 208VAC at 30A, would greatly ease man- 
agement of the underfloor power distribution system and 
would make it easier to standardize on a single type or 
vendor of in-rack power distribution units (PDU). It is 
already hard enough to find PDUs with certain features — 
per outlet power monitoring, metering, and control, rea- 
sonably secure remote access, and a usable API for de- 
veloping our own applications — that trying to account 
for even a handful of circuit types makes this impossi- 
ble given current vendor offerings and impede efforts to 
gain better insight into power usage. We could pursue 
this work on a piecemeal basis, but we wonder if that 
would result in less standardization. Potentially the most 
extensive work we would like to pursue is transition to an 
overhead power distribution system. Though obviously 
costly, this would yield huge benefits in restoring orderly 
airflow to the underfloor air circulation space, easing of 
time costs of dealing with numerous circuits types and 
simplifying power usage surveys and audits. 


On the non-technical side, we look forward to ad- 
dressing clear deficits in key areas such as datacenter, 
network, and project management, communication and 
culture barriers between research and production opera- 
tions, and management awareness of staff expertise along 
with staff awareness of management plans. We expect 
that this work will be ongoing for the rest of our profes- 
sional careers — not because we think that we will always 
have these deficits but because the only way to avoid de- 
veloping these sorts of deficits is by continually working 
to ensure that they do not develop. 
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Conclusions 


We pulled it off, but just barely. We needed every single 
last day available in order to complete the work necces- 
sary and could have used an extra day or two for breath- 
ing room. We got by on large measures of determination 
and dedication, resourcefulness, and sheer dumb luck. 
We look at this work as an accomplishment worth being 
proud of but also as an example illustrating all the things 
that one should have — many of which we did not — in 
order to embark on a similar datacenter migration adven- 
ture with a more reasonable chance of success and better 
options in case of something less than 100% success. 
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Notes 


‘Rough estimates for network closet relocation ran in $500K range. 

21 ton of cooling is 12,000 BTU/hr or 3,517 W. The refrigeration 
and air conditioning fields use this unit to denote the heat required to 
melt 1 short ton, 2000 Ibs, of ice at 0 °C in one day, representing the 
cooling provided by daily delivery of 1 ton of ice. 

31 kVA is 1 kilovolt-ampere and is used to measure “Apparent 
Power’, the product of root-mean-square voltage and current. “Real 
Power’, measured in watts (W), refers to the power actually usable by 
devices. 

4This “system” amounts to a grid of overhead Panduit-style fiber 
trays that meets up with an in-room fiber termination box fed by a 
nearby network closet. 

>This trend would peak in the mid-2000s with the installation of a 
128-node Itanium? cluster. 

©The switch was part of a large donation, so we do not know how 
much it cost the vendor to give it to us, but the current street price with 
optics is around $6000. 
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Abstract 


High Performance Computing systems are complex to stand up and integrate into a wider 
environment, involving large amounts of hardware and software work to be completed in a fixed 
timeframe. It is easy for unforeseen challenges to arise during the process, especially with respect 
to the integration work: sites have dramatically different environments, making it impossible for 
a vendor to deliver a product that exactly fits everybody’s needs. In this paper we will look at 
the standup of Cielo, a 96-rack Cray XE6 system located at Los Alamos National Laboratory. 
We will examine many of the challenges we experienced while installing and integrating the 
system, as well as the solutions we found and lessons we learned from the process. 

Tags: HPC, configuration management, system integration 


1 Introduction 


High Performance Computing (HPC) systems are complex to stand up. They generally involve 
the delivery of a large amount of hardware at one time that must be installed, configured, and 
integrated in a fixed period of time in preparation to be run in relative steady state for five to 
ten years. While physical installation is usually a job for the hardware vendor, configuration 
and integration normally fall on the shoulders of a team of integrators and system administrators 
that will be charged with managing the system after it is put in production. Configuration and 
integration include such complex tasks as configuration management (CM) system design and 
implementation, parallel filesystem and network integration, and accounts system and user services 
integration. Standing up a new HPC resources can be similar to bringing up an entire new data 
center from scratch, with all of the same difficulties and pitfalls. 

In this paper we will look at the installation and integration of Cielo, a 96-rack Cray XE6 
compute resource sponsored by the Alliance for Computing at Extreme Scale (ACES) project and 
located at Los Alamos National Laboratory (LANL) and Sandia National Laboratory (Sandia). 
The complete Cielo family consists of four machines: Smog and Muzia, two 1/3-rack test systems; 
Cielito, a l-rack development system; and Cielo, a 96-rack production system. Muzia is located 
at Sandia, while Smog, Cielito, and Cielo are located at LANL. In this paper we will focus on 
Cielo, but all four systems run the same software stack and have similar configurations, and our 
experiences with Cielo align with our experiences on the smaller systems. 
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2 System Overview 


Cielo is a 96-rack, 142,304-core Cray XE6 system operated by the HPC Division at LANL. Delivery 
of the system began with the arrivals of Smog and Muzia at their respective laboratories in May 
of 2010, followed by delivery of Cielito in June of 2010 and 72 racks of Cielo in August of 2010. 
The final hardware delivery occurred in April of 2011, when Cielo was expanded to its full size of 
96 racks. In this final configuration, Cielo consists of 8518 16-core compute nodes with 32GB of 
memory each; 376 16-core visualization compute nodes with 64GB of memory each; 286 IO nodes; 
16 internal login nodes; and a handful of infrastructure nodes. Cielo debuted on the ‘Top 500 List 
of Supercomputer Sites at position six in July of 2011. 

The Cray XE6 is a massively parallel processing system that consists of compute and service 
blades connected by a high-speed torus network. ‘he basic building block of the system is a chassis 
that holds eight compute or service blades. Each of these blades contains four diskless nodes that 
are designated as either compute nodes, where actual jobs run, or service nodes, which provide 
management, data virtualization service (DVS), or other services to the machine. Three chassis 
can be placed in a rack, and many racks can be combined in rows to create large systems. All of 
the nodes are connected by Cray’s Gemini network, providing a three-dimensional torus high-speed 
network for user job communication, node management, and filesystem access. 

All compute nodes are headless, diskless machines with no external network connections, but 
service nodes have one or two on-board PCI slots for expansion. ‘These slots are most commonly 
used to provide outside network connections for login nodes or file server nodes. ‘Two of the service 
nodes are designated as special: the boot node and the service database (sdb) node. These two 
nodes have external connections to a RAID device, the bootraid, and are used to provide persistent 
services to the rest of the nodes such as logging, root filesystems, and administrator management 
capabilities. One main task of the boot node is to serve the sharedroot from the bootraid to the 
other service nodes, who mount it as their root filesystem. In our case we also have a set of DVS 
nodes that serve this filesystem out to the compute nodes, giving them an optional complete Linux 
environment. 

Outside of the main XE6 racks are two classes of standard rack-mounted machines: the system 
management workstation (SMW) and external login nodes. The SMW is the main administration 
server for the system, controlling low-level hardware access, system bootstrapping, and system 
ramdisks. The external login nodes are larger memory, diskfull analogs to the standard login nodes 
on the system blades that make user access more similar to standard clusters. 


3 Challenges 


Before pieces of Cielo even showed up at lab, the system administration team realized that we were 
going to have challenges integrating a Cray system into our environment. We currently run on the 
order of 20 HPC clusters, ranging from tiny (tens of nodes) up to huge (thousands of nodes), and 
we have been very careful to put configuration management at the forefront when installing new 
systems. All of our existing clusters are fully managed by Cfengine|1], with a collection of dedicated 
or multi-cluster CM servers covering every system in every cluster. This has tremendously helped 
our relatively small team keep all of the systems in sync and under control. 

We quickly discovered that Cray has taken a “fully managed appliance” approach with the XE 
system. Part of this is due to the way the system itself is designed: the compute infrastructure 
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is tightly coupled by its torus network, with the compute and service node blade hardware fully 
managed by their control system. The compute nodes run a stripped down Linux-based operating 
system that is shipped as a Cray-provided image, while the service nodes run a Cray-branded 
release of SuSE’s SLES 11 Linux distribution. Were we running Cielo as a stand-alone system, 
these features would all be fantastic: we could dedicate people to the system to keep it patched and 
running via Cray’s methods, and that would be that. In fact, this appears to be a common way for 
sites to run large Cray systems: Cray will assign one or more software and hardware engineers to a 
site to handle much of the day-to-day management of the system, leaving the system administration 
team with more of a black box system to take care of. 

Due to the existing usage models on our other systems, it was decided that we had a strong 
business case to not take the fully managed appliance route with Cielo. This way we could provide 
users with the most consistent environment across all of our systems while keeping our system 
administration team as flexible as possible. Instead, we needed to have our administrators integrate 
the system into our family of compute resources as tightly as we could. In our case, this was 
definitely the right decision: we’ve spent years keeping a wide variety of systems consistent to 
minimize the learning curve of existing users on new systems, and it has worked very well. However, 
it did present a series of challenges to the system integrators and administrators bringing up the 
systems. The challenges related to bringing up Cielo can be categorized into two broad areas: 
vendor relations and software. 


3.1 Vendor Relations Challenges 


The integration team that brought up Cielo was very fortunate to have a strong working relationship 
with Cray. The Cielo contract included provisions for Cray hardware, software, and application 
engineers to be located on-site at LANL, and through the bringup process we had access to many 
extra resources at Cray as we needed them. However, instead of just using the engineers as managers 
of the appliance, we worked with them to fully integrate the system with our already existing 
infrastructure. This being the first Cray system located at LANL in many years, there were some 
challenges on both sides as we worked out how we could work together most effectively. 

Since we were looking to closely integrate Cielo into our environment, we ended up digging 
deeper into the inner workings of the machine than some of our Cray engineers were used to. In the 
process, we found that there was a lot of in-house knowledge at Cray that, although easy to get, 
was not always obvious to ask about. There were several times that we learned that a standard 
assumption about the management of the machine was incorrect, including such details as the best 
way to reboot nodes, the most effective way to correlate logs between the compute and service 
nodes, and how to update software. One of our running gags became the hypothetical response 
“Oh, you still do it that way?” to any question we asked of Cray. 

On the bright side, we also had good challenges in this area: Cray provided written step-by-step 
procedures for almost everything we needed to do to the system. In many ways we weren't ready for 
this, especially when compared to the many commodity clusters we have in our environment. The 
biggest challenge here was figuring out how to make the best use of the documentation, whether it 
be importing the actions into Cfengine scripts, importing the documentation into our own library, 
or explicitly deciding to follow our own path. Cray was also very receptive to our suggestions for 
improvement. The challenges listed in the next few sections are often clearly related to our desire 
to do something our way instead of Cray’s. However, they were ready to hear our suggestions and 
pass them on to their developers in most cases. There were several instance throughout Cielo’s 
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bringup where specific nuances we found difficult were changed in later software updates. 


3.2 Software Challenges 


Over the course of the year that we have been working with Cielo, the majority of the challenges 
we have faced have come on the software front. As mentioned before, it is very common for Cray to 
assign a group of software engineers to a site with one of their large systems. We believe this practice 
has lead to many of our difficulties - Cray wasn’t ready for us to be downloading and working with 
many of their software products, and we didn’t have the experience needed to understand some of 
their distribution, packaging, and installation decisions. 





3.2.1 Software Releases 


There are three styles of software updates that we have worked with from Cray: cumulative service 
pack-style updates containing all previous updates for a particular product; individual patches and 
field notices that will eventually be rolled up in a cumulative update; and sliding window updates 
that contain older versions of the related software for compatibility as well as the latest update. 
Cumulative updates are generally released quarterly and contain new functionality, bug fixes, and 
other substantial updates. Individual patches and field notices are released as needed between the 
quarterly updates and generally fix bugs or security issues. Sliding window updates are used for 
the Cray programming environment and includes the both the latest and the last n releases of 
their supported compilers and libraries, where n is determined by Cray’s release engineers based on 
provided functionality and customer usage. Each style of update is packaged differently, but each 
generally consist of a monolithic installer script, one or more directories full of RPMs, and fairly 
detailed instructions on how to install the update. 

These individual release styles are further fragmented by individual update idiosyncrasies. Some 
updates are available publicly to all registered Cray users, while some are only available to Cray 
engineers. Most are applied by use of an included monolithic install script, but some are applied 
in a more manual process by following instructions in an install document. Some are applied in 
a way that will be preserved with future updates, while others are applied in a less stable way. 
Finally, versioning can be confusing: many packages include an SVN repository version number in 
their version string that refers to the repository revision from which it was generated. If branching 
happens in an unusual way, this can (and does) lead to newer software having a lower “version 
number” than already-existing software. All of these details are easily absorbed when the system 
is looked at as a standalone appliance, but in a more integrated environment that is used to a high 
level of update automation, they present a challenging hurdle. 


3.2.2 Software Practices 


Along with acclimating to Cray’s software release methods described above, we ran into challenges 
with the software they contained. One of the biggest difficulties involved abuses of the RPM pack- 
age management system. Most of the original software installs and subsequent updates provided 
by Cray comes in the form of RPM packages and monolithic install scripts that examine the hard- 
ware and software currently in use on the system and install required software as needed. This 
involves installing most packages with --nodeps and --force options that override RPM’s built-in 
dependency and safety checks. Again, these options fit the appliance model very well: the supplied 
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software is vetted by Cray in a specific format, and they want to replicate that format closely in the 
field. However, they make verification and integration into an already-existing CM system difficult. 

Similarly, Cray’s distributed RPMs often make use of postinstall scripts to take care of large 
portions of the install. This use ranges from fairly benign (creating links if some other packages 
are already installed) to difficult to manage (RPM only contains a .tar.gz file that is unpacked by 
a postinstall script). These “write-only” RPMs also fit into the appliance model, but have many 
problems in a wider-ranging environment: they are difficult to verify, they can seem to behave non- 
deterministically depending on install order, and they tend to require more hand holding than more 
well-behaved packages. We found these packages especially difficult to place under CM control, as 
discussed in the next section. 

Finally, we ran into several inconsistencies with respect to how software versions were managed. 
The modules|2] package and /etc/alternatives system|3] are two existing packages created to 
ease the selection of and switch between multiple versions of equivalent software installed on a 
system. The Cray software stack uses both of these packages to manage its software versions, even 
using both on the same package in some cases. In most cases they also use a third method involving 
“default links”: symbolic links in each package’s install path that point to the install root of the 
version that should be treated as default. ‘These different methods are used to varying degrees by 
different pieces of the overall system - the modules are mostly used for normal users of the system 
to choose compilers, libraries, and related packages; while the alternatives and default links are 
mostly used by system-level processes. However, there is overlap between the usage of each. 


3.2.3 Configuration Management 


We discovered many of the challenges listed above while working to put Cielo under complete con- 
figuration management control. LANL’s HPC division has traditionally used Cfengine to manage 
its systems, relying on it to create a uniform management environment across all of its clusters. 
Early on we recognized that this would be more difficult on Cielo than on traditional clusters, 
but we knew that it was the best way to integrate the new machine into our group’s management 
rotation. 

The software practices mentioned in the previous section all made configuration management 
challenging: creative uses of RPM, inconsistent versioning, and multiple management methods all 
add layers of complexity to the CM problem. However, the biggest challenge of all was Cray’s use 
of monolithic install scripts. While very handy when installing software and updates interactively, 
these scripts made it difficult to automate configuration of the machine. Some of these scripts 
were easy to analyze and either import into Cfengine or call directly during the install process. 
Others were much more problematic: one explicitly checked to make sure it was connected to a 
TTY and exited with an error if it wasn’t, while another stopped in the middle and presented a set 
of commands for the administrator to run in a second window before telling the script waiting in 
the first to continue. In another case, the script didn’t even trust itself - after running, it instructs 
the administrator to check its work and confirm it had written out various files correctly. It turns 
out this was a needed step each time we ran it. 
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4 Responses 


The above sections may sound like a large doom-and-gloom scenario, but in the end we were able 
to integrate Cielo into our environment in a way that is relatively easy for our administrators to 
pick up quickly. While we could have run the machine as a appliance-like system and not needed 
to deal with most of the challenges we discussed, we concluded that closer integration with our 
existing systems would help our small system administration team take on the system quickly after 
integration was complete. That made each of the challenges into actual problems and pushed us 
to find solutions for them. 


4.1 Working Together 


One of the most important things we did was keep a strong working relationship with Cray, our 
vendor. While it would be very easy for the clashes between our ideal system and their real-world 
products to result in deep fighting and animosity between the two groups, we were all able to keep 
a good relationship. Cray was eager to hear our concerns, fix problems, and submit idea cases 
when appropriate, and we were open to understanding their reasoning behind the design choices 
they made. We believe this is an important thing for both vendor and customer to keep in mind 
during a machine standup - both sides need each other, and keeping a good relationship is very 
beneficial in the long run. 

Along with working closely with our vendor, we were careful to work closely across teams at 
LANL and Sandia. On the systems side, the Cielo bringup was a collaboration between HPC-5 
(the system integrators) and HPC-3 (the system administrators) at LANL and the Cray support 
team at Sandia. Having these three teams work together gave us great power: we had the Cray 
experience from Sandia, the new system integration experience from HPC-5, and the long-term 
production system experience from HPC-3. While the nature of our environment made in-person 
collaboration the most useful form, we also made use of standard conference calls and email lists to 
keep each other in sync. The HPC-5/HPC-3 collaboration was especially important, as it made the 
transition from integration to production smooth and much less painless than a “throw it over the 
fence” model would have provided. Being able to work closely between all of the groups without 
chain of command overhead made it easy for us to make quick progress with the project. 





4.2 Configuration Management 


Another very important decision we made was to use a CM tool from the beginning. Although 
this could be seen as the source of several of our challenges, we would have had a much larger 
set of more difficult challenges without it. With Cielo, we ended up using a layered approach to 
managing the various parts of the system. Since the majority of the cluster is diskless, our final 
CM scheme had a small number of nodes that actually ran the Cfengine client: the SMW, the boot 
node, and the external login nodes. Everything else was managed by the sharedroot area (from the 
boot node) or the ramdisk images (from the SMW). With this design, we effectively had only a 
handful of Cfengine product areas to manage. This simplification made it easier to quickly grasp 
the design of the system and push out changes to the large number of nodes in the system. 

Of course, after putting our CM system in place we still had a number of management tasks 
that required manual work. Most of these revolved around the monolithic install scripts mentioned 
previously - some of these were impossible to automate, while others just weren’t worth the time. 
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For these we decided the best route was to document the exception and train new system adminis- 
trators to recognize when they needed to do things by hand. In some cases, such as rebuilding the 
compute image ramdisk, we were able to have Cfengine print out a message after a successful run 
telling the administrator what more needed to be done by hand. In other cases we needed to rely 
on the carefully-maintained documentation wiki that the LANL administrators already use. By 
modeling the Cielo documentation off of existing documentation for other systems, we were able 
to fit these manual processes in to the mindset already known by the system administration team. 

By implementing a complete configuration management scheme from the beginning, we were 
able to make several big changes to the system relatively quickly and painlessly. The first happened 
when we swung Cielo from our open network to our classified network: this required rebuilding the 
entire machine from scratch, which we were able to do in a matter of days using the configuration 
management system and documentation we had created. Later, we were able to quickly rebuild the 
system again when the upgrade from 72 to 96 racks required a large change in machine topology. 
Immediately after that, we made a quick upgrade between two Cray service packs that had caused 
problems at other sites with little trouble, mostly because we had a fully managed system and 
could recognize which system components had changed in incompatible ways. In short, our early 
effort has repaid itself several times over already. 


4.3 Homegrown Tools 


While bringing up Cielo, we found several system management deficiencies that weren’t quite met 
by existing Cray management tools, but were too specialized to our environment to submit as a 
cases to Cray. Instead we wrote our own tools to fill in the gaps. 


4.3.1 xtautorpm 


Most of Cielo’s compute and service nodes are diskless systems that use a ramdisk and an NFS- 
mounted read-only root filesystem to provide their operating system environment. ‘The NFS filesys- 
tem provides system specialization of files through a layered approach, with the base filesystem 
being overlaid by views of node-specific files. These systems are managed by an interactive Cray 
tool named xtopview that handles package installation, file specialization, and other management 
tasks by presenting the administrator with a chrooted environment corresponding to the specialized 
view of each system or class of systems. This extra layer made our team’s standard management 
methods difficult, as it is designed to be run interactively, only one person can run the xtopview 
utility at a time, and the utility has no provision for using tools such as yum to install packages. 

To alleviate these restrictions, we expanded one of our already-existing tools under the name 
xtautorpm. This new tool automates acts as a layer between Cfengine and the rpm command, giving 
Cfengine the abstraction needed to use xtopview directly. With this extra layer of abstraction, we 
made the package installation procedure identical to that on our other clusters without losing the 
support of the vendor supplied tools. 


4.3.2 xtfixdefault 


As mentioned earlier, Cray uses several software version management schemes on their systems. We 
found it time consuming to manually manage both the modules environment (which we understood 
well) and the “default links” system that Cray introduced. To prevent version skew, we wrote a 
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tool named xtfixdefault to keep the two systems synchronized. Since we were already familiar 
with the modules system, we decided to use it as the base for our versioning. When Cfengine runs 
xtfixdefault, the utility checks all of Cray’s default links and confirms that they are pointing to 
the same software versions as the modules environment’s default version. When run interactively, 
the tool can also be used to update the modulefile from the default link and report on which 
modulefiles and default links are not the same. With this one utility we can both enforce our 
will over the software versions with Cfengine and report changes performed by Cray’s monolithic 
software installers. ‘This utility has made software updates much less time consuming. 


4.3.3 ethcfg 


The file specialization provided by the xtopview command is generally used at a class level to 
cover a large group of service nodes at once or at the individual node level to make one node stand 
out from the others. Both of these cases are simple and straightforward to manage. However, 
there is one specialization case that requires every node to have its own file: the static network 
management files. Standard configuration management systems avoid the need for hundreds of 
node-specific files by using templates, DHCP, or other similar solutions, but these did not fit the 
Cray model well. Instead we wrote a simple-but-powerful init script dubbed ethtool that configures 
the nodes’ network interfaces by reading a flat configuration file at boot time. This file contains the 
network interface information for all of the service nodes in the system, meaning it can live in the 
default overlay view and requires no specialization for each node. The number of nodes included 
in the file is small enough that we found no performance problems with a flat file, giving us ease of 
maintainability over a more complex system using something like SQLite. 


5 Lessons 
After bringing up Cielo, we were able to put together a few lessons we learned along the way. 


Keep good relations with your vendors : It is all too easy for vendor relations to break down 
when you don’t see eye to eye with them. Keeping a good relationship makes it much easier 
to keep all sides progressing throughout the project. 


Get test systems early : Although they were only mentioned briefly at the beginning of this 
report, our three smaller systems (Smog, Muzia, and Cielito) were instrumental in getting us 
experience with Cray’s way of doing things early. When building a new system, getting access 
to representative hardware early in the process fills the knowledge pipeline much faster. 


Use configuration management, even if it takes effort : The upfront cost of configuration 
management is easier to see than the long-term gains, but those gains are real. Whether you 
should work to fit a system into an existing CM scheme or not is a site-specific question, but 
using some tool is the best choice in any complex case. 


When standing up a system of a new design, plan for “Murphy Time” : Murphy’s Law 
will assert itself as often as it can, especially with new systems. Be ready for that. Finishing 
early is much more impressive than finishing late. 
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Work as a team : ‘Joday’s systems are too complex for one person to fully understand. ‘There 
are too may pieces: hardware, software, networking, filesystems, system management, and 
the list goes on. Working as a team is important for sharing responsibilities and areas of 
expertise with a new system and for keeping everybody interested in the project. Resist the 
urge to designate “the guy that knows it all”, as he will inevitably win the lottery and leave 
the group. 


6 Conclusions 


As we stated in the beginning, standing up a new HPC resource is a complex task. While integrating 
Cielo, we ran into an expected breadth of challenges: managing the vendor/customer relationship, 
working with integrating an appliance-like system into an already-existing environment, designing 
a configuration management system around an imperfect software distribution design, and other 
more minor challenges. In our case these were all framed within the desire to make the system 
behave similarly to an already extensive set of HPC resources, a requirement from both the user 
and administrator points of view. 

We were able to respond to these challenges with a combination of technical and social solutions 
involving, among more minor solutions, a close working relationship between our vendor and local 
teams, using strong configuration management and careful documentation when appropriate, and 
writing custom tools to fill in gaps as needed. The combination of solutions we found kept us flexible 
enough to make good decisions each time while getting the work done in the needed timeframe. 

In the end, we were able to put together a short list of lessons that we thought were important 
from our experience. On the top of that list was the need to keep strong working relationships with 
all of the groups involved. Closely following this was the need for configuration management from 
the beginning. The list was rounded out with other lessons that are obvious in hindsight, but easy 
to lose track of in the heat of getting work done. 

The final result of the work described in this report is a very manageable system. Like all systems 
of Cielo’s complexity, there will always be work to be done, but we have a strong foundation on 
which to continue building and we are confident in the work we have done to integrate it into our 
environment. 
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Abstract 


Managing storage growth is painful [1]. When a 
system exhausts available storage, it is not only an 
Operational inconvenience but also a_ budgeting 
nightmare. Many system administrators already have 
historical data for their systems and thus can predict 
full capacity events 1n advance. 


EMC has developed a capacity forecasting tool for 
Data Domain systems which has been in production 
since January 2011. This tool analyses historical data 
from over 10,000 back-up systems daily, forecasts 
the future date for full capacity, and sends proactive 
notifications. This paper describes the architecture of 
the tool, the predictive model it employs, and the 
results of the implementation. 


Tags: 
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1 Introduction 


Data storage utilization is continually increasing, 
causing the proliferation of storage systems in data 
centers. Monitoring and managing these systems 
requires increasing amounts of human resources and 


therefore automated tools have become a necessity. 


IT organizations often operate reactively, taking 
action only when systems reach capacity, at which 
point performance degradation or failure has already 
occurred. Instead, what is needed is a proactive tool 
that predicts the date of full capacity and provides 
advance notification. 


Predictive modeling has been applied to many fields: 
forecasting traffic jams [2, 3], anticipating electrical 
power consumption [4], and projecting the efficacy 
of pharmaceutical drugs [5]. Within the IT field, 
capacity management of server pools has been 
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studied [6]. Ironically, there seems to be little 
previous work discussing applications of predictive 
modeling to data storage environments. 


During the past year a predictive model has been 
employed internally at EMC to forecast system 
capacity and generate alert notifications months 
before systems reach full capacity. The ultimate 
purpose of this tool is to provide customers with both 
time and information to make better decisions 


managing their storage environment. 
2 Data Collection 


Data Domain systems are backup servers that employ 
inline deduplication technology on disk. All Data 
Domain back-up storage devices have a “phone- 
home” feature called Autosupport. Customers can 
configure their Data Domain systems to send an 
email every day with detailed diagnostic information. 
In addition, they can send email when specific events 
are encountered by the operating system. Once these 
emails are received at EMC, they are parsed and 
stored in a database. 


Sending of diagnostic data via email to EMC is 
Often, 
customers choose to disable the 


voluntary by the customer. in secure 
environments, 
feature. In order to monitor their systems, customers 
have the ability to configure the autosupport emails 


to be delivered to internal recipients. 


Most customers choose to send autosupports to EMC 
because the historical data enables more effective 
Given the more than 10,000 
autosupports received daily, EMC has a statistically 
significant view across the Data Domain install base. 


customer support. 
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For the purpose of capacity forecasting, two variables 
are required at each point in time: 


1. Total physical capacity of the system 
2. Total physical space used by the system 


For Data Domain systems, the total physical capacity 
changes over time because they generate an index 
which slightly decreases the amount of physical 
capacity available for data storage. 


3 Data Cleaning 


In order to ensure data integrity there are two issues 
to be addressed: data artifacts and elimination of non- 
production data. 


Data Artifacts: In order to prevent bad data from 
entering the analysis, the tool assesses the quality of 
every autosupport and applies rules to guarantee 
consistency. These artifacts may arise due to an error 
in parsing the autosupport, data corruption during the 
transport of the autosupport, or both. 


Non-Production Data: All internal Data Domain lab 
systems and QA systems send autosupports which are 
parsed and loaded into the database. These systems 
may be under development and therefore their 
performance characteristics may vary dramatically 
from production systems being used in the field. 
While this data is of value to internal teams, for the 
purposes of capacity forecasting, the data from these 
systems is excluded. 


4 Predictive Model 


One of the most common methods employed in 
predictive modeling is linear regression. 
Unfortunately, application of regression to storage 
capacity time series data is challenging because 
behavior changes. System administrators may add 
more shelves to increase capacity, change retention 
policies, or simply delete data. Therefore blind 
application of regression to the entire data set often 
leads to poor predictions. 
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Figure 1: Example capacity data for the prior 100 
days. (Time = O is the most recent data.) The 
standard deviation is 6 thoughout the data. The blue 
line shows the result of applying linear regression to 
the entire data set. 


The predictions of the linear regression in Figure 1 
are very poor. Intuitively, the data indicates the 
system is going to reach 100% capacity within a few 
days, but the regression line predicts far later (a false 
negative). 


Select a Subset of Recent Data 


The simplest method to mitigate the issue illustrated 
in Figure | would be to choose a subset of recent data 
such as the prior 30 days. This eliminates the 
influence of older data and improves the accuracy of 
the model’s predictions. Unfortunately, using a fixed 
subset to model all systems results in poor linear 
models for many systems. Significantly more 
accurate models can be obtained by finding the 
optimal subset of data for each system and applying 
linear regression to only that subset of the data. 
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4.1 Piecewise Linear Regression 


The error rate of the linear regression model can be 
significantly reduced by applying the regression to a 
data subset that best represents the most recent 
behavior. This requires implementing piecewise 
linear regression [7]. 


In order to find the best subset of data, the boundary 
must be determined where the recent behavior begins 
to deviate. The method described here analyses the 
quality of many linear regressions and then selects 
the one having the best fit. 


The goodness-of-fit of a linear regression to 
experimental data can be measured by evaluating the 
coefficient of determination R*. It is defined as the 
regression sum of squares (“SSM?’’) divided by the 
total sum of squares (“SST”’’) [8]. 


R2 — SSM _ Yilfxi) — y]? 
SST —- duilyi — YI? 
Properties of R* 
© 0<R*<1 
¢ R’=1 indicates perfectly linear data. 


Calculating the Boundary: Start with a small 
subset of the data, such as the prior 10 days, and then 
apply regression to incrementally larger subsets to 
find the regression having the maximum value of R’. 


Regress {(X10, Y-10), (X-9, Y-9), +++s (Xo, Yo)s 
Calculate R’ for regression 


Regress {(X11, Y-11)s (X10 Y-10)s +++» (Xo, Yo)$ 
Calculate R’ for regression 


Regress {(Xa; Y-n)s (Xn, Y-nt1)s see9 (Xo, Yo)} 
Calculate R’ for regression 


ot St ee 


Select the subset with maximum R” 


The boundary is the oldest data point within the 
subset of data determined in step 8. The predictive 
model is generated by applying linear regression to 
that subset. 
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Figure 2: The same data used in Figure 1 with R’ 
plotted for each subset of data. The date when R’ 
reaches its maximum value is the “calculated 
boundary” and occurs near the discontinuity of the 
true function. Maximum R’ = 0.95 at -48 days and 
the true boundary is -40 days. 
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Figure 3: The same data from Figures 1 & 2. 
Piecewise linear regression results in a better fit to 
the data. This model was generated using the subset 
{(X.48, Y-48), ---» (Xo, Yo)} determined by the boundary. 


Preprocessing data by applying a smoothing function 
can increase R*, but has limitations. Filtering out 
noise while maintaining the signal is easier said than 
done. Too much smoothing and it becomes too 
difficult to determine the boundary point. 
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4.2 Other Models 


Many other models can be applied to time series data, 
such as weighted linear regression, logarithmic 
regression, and auto-regressive (AR) models. In the 
current implementation, a simple linear model has 
shown to effectively model many systems (see 
Section 5: “Results of Predictive Modeling”). It 1s an 
open question whether the remaining systems can be 
modeled by other methods. 


4.3 Model Validation 


The model needs to be able to say, “I don’t know.” 
Sometimes there is no pattern in the data. Before 
employing a model to predict future behavior, it 
should be evaluated to determine if it 1s reasonable 
model for the data set. In the current implementation, 
validation rules are applied to the results of the linear 
model to determine if capacity forecasts should be 
published. 


Goodness-of-fit: When the R* value from piecewise 
linear regression is too small, it indicates the model is 
a poor fit to the data. 
regression models with R* < 0.90 are not used. 


In the current tool, linear 


Positive Slope: Linear models having a zero or 
negative slope cannot be used to predict the date of 
100% full. 


Timeframe: Forecasts for systems to reach full 
capacity far into the future are extrapolating the 
current behavior too much to be practical. The 
current model limits forecasts to less than 10 years. 
The expectation is that within 10 years the storage 
technology will be significantly different than it is 
today. 


Sufficient Statistics: Storage systems that have been 
recently deployed lack enough historical data to 
produce statistically sufficient regression models. A 
minimum of 15 days of data is a reasonable threshold 
for the size of the data set. 


Choosing a smaller minimum value may result in 
fitting the model to noise. Linear regression can 
achieve a very good fit to a handful of data points, 
but the results are not statistically significant. 
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Space Utilization: Experience has shown that 
systems which are less than ~10% full tend not to 
produce reliable predictions. For this situation, the 
current tool does not generate capacity forecasts. 


Last Data Point Trumps All: Recent changes in 
system capacity must be taken into account to 
evaluate the linear fit. When systems are nearing 
maximum storage capacity, the administrator often 
takes action which results in drastic changes in the 
amount of available capacity. If the administrator 
reduces the amount of data stored on a device, the 
capacity prediction of the model is no longer valid. 
Assessing this error is a simple form of cross- 
validation. 
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Figure 4: System capacity dropped from 100% to 
50%. The generated linear model has a high 
goodness-of-fit (R* = 0.89) but the prediction for the 
most recent data point has 35% percent error. The 
model predicts the system is 85% full at Time=0, but 
it is only 50% full. 


If the error between the predicted value and the actual 
value of the most recent data point exceeds 5%, it is a 
good indication that the recent data diverges 
significantly from the model and therefore the model 
is no longer valid. 
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5 Results of Predictive Modeling 


5.1 Analysis of Linear Regression Fit to 
Past Data 


If historical data does not demonstrate linear growth, 
then obviously linear regression would be a poor 
model to employ. To investigate this issue, the 
piecewise linear regression algorithm described in 
section 4.1 was applied to the historical dataset from 
Data Domain storage appliances and the maximum 
R’ was calculated for each system. 
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Figure 5: Histogram of R’ across all systems using a 
minimum 15 days of data. This illustrates that most 
of the regression models generated for storage 
systems have R’ close to 1.0. 


Summary of results: 

1. The median R’ for all systems was 0.93 
2. Models for 60% of systems had R*> 0.90 
3. Models for 78% of systems had R*> 0.80 


These results indicate that the majority of systems 
exhibit very linear behavior since the linear model 
had a very good fit to the datasets. 


5.2 Forecasting Full Capacity 


After the model is generated from historical data, the 
next step is to apply the validation rules described in 
section 4.3. For models that pass validation, the final 


step 1s to solve for the future date the system will 
become 100% full. The linear model: 


y=a+t px 


Definitions: 
* ys capacity 
¢ ais the intercept term 
¢ 1s the slope 
¢ xs the date 


Assuming the slope is positive (B >0), the future date 
for the system reaching full capacity can be 
calculated by setting the capacity y = 1 (100 %) and 
solving for x: 


1-a 





Forecast Full Date: x = 


5.3 Analysis of the Quality of Forecasts 
False Positives 


False positives frequently originate from unforeseen 
future human activities which cannot be predicted by 
the model. It is difficult to construe such false 
positives as flaws in the model per se given that the 
only input provided to the model is historic behavior 
of the system. 


When a system is on a linear trajectory to full 
capacity but never reaches 100% full, it is may be 
due to external or internal events. An external event 
may originate from a significant change in the 
amount or rate of data placed into primary storage. 
An event internal to the system may be caused by the 
system administrator taking action to implement 
configuration changes. These can include: 


1. Hardware changes 
a. The system was entirely replaced 
b. A shelf was added , increasing capacity 
c. Internal disk drives were replaced 
2. Software changes 
a. Retention policy was changed 
b. Data was deleted and/or moved 


A specific example may help elucidate the issues 
concerning false positive capacity forecasts. Even 
with visual inspection of the data by a human, it is 
extremely difficult to assess a false positive a priori. 
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Figure 6: System exhibiting several changes in the 
rate of storage utilization. At this point in time the 
regression may be a false positive. 


Visual inspection of the storage capacity of the 
system in Figure 6 indicates that the rate of storage is 
on a trajectory to reach full capacity in September. 
However, the most recent data in August might be an 
early indication that the trajectory is changing. This 
recent data may imply the system 1s stabilizing near 
60% of capacity, but at this point in time there 1s 
insufficient data to establish a new trajectory. 


From a statistical perspective, it is unknown whether 
the recent data points are signal or noise. This 
illustrates how allowing the use of small data sets has 
the risk of fitting the model to noise. 


Ironically, in spite of the intuitive uncertainty, the fit 
to data is very good: R* = 0.90 and the prediction 
error is only 4.5% on the most recent data point. This 
example is potentially a good candidate for the model 
to fail validation and report, “I don’t know.” There is 
a trade-off between eliminating reasonable models 
versus generating false positives. By requiring more 
data for models, we gain higher confidence in their 
predictions, but reduce the advanced notification for 
true positives. 
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Figure 7: Same system shown in Figure 6 with 
additional data points. 


After a few more days, the piecewise regression 
model fits the recent behavior of the system in Figure 
7. Only after obtaining more data can we determine 
that the model in Figure 6 was a false positive. It is 
often the case that false positives can only be 
observed with the benefit of hindsight (addition data). 


No Forecast for Full Capacity 


When a model fails validation (described in section 
4.3) no forecast should be made. On a typical day the 
current model does not publish forecasts for 
approximately 40% to 50% of all systems. This is 
not a surprising result. Most systems are expected to 
be efficiently managed by their administrators. The 
model is only considered valid for systems which are 
on a trajectory to full capacity in the future. 


It is an operational decision to determine the quantity 
of forecasts to be published. The percent of systems 
for which forecasts are published can be easily 
adjusted by tailoring the validation rules for each 
environment. 


USENIX Association 


USENIX Association 


5.4 Analysis of Forecasts across Install 
Base 


Application of the model described to the entire 
install base results in a number of observations. 


Histogram of Forecasted days to 100% capacity 


20 


Percent of Total 


0 500 1000 1500 2000 
Days to 100% Full 


Figure 8: Histogram of forecasts for systems to reach 
full capacity. The median time to 100% full 1s 197 
days. Therefore, for systems with valid models, the 
forecast is half of them will reach full capacity within 
approximately six months. 
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Figure 9: Greater detail (6 months) of the data used 
in Figure 8. 


Given the peak values of these histograms, a majority 
of the systems are predicated by the model to reach 
full capacity in the near future. There are at least two 


conjectures that may explain the patterns of Figures 8 
& 9: 


Hypothesis 1: Efficient use of capital: Since the 
cost of storage (dollars per GB) drops quickly over 
time, the majority of storage devices are intended to 
only have enough space for the near future. It’s 
cheaper to delay the purchase of additional storage 
until it’s absolutely needed. 


Hypothesis 2: Capacity Exceeded Expectations: 
System administrators forecasted their capacity needs 
for the long-term, but they underestimated the rate of 
growth. 


6 Capacity Forecasting Examples 


The application of capacity forecasting may be 
illustrated by examining a few examples of 
production Data Domain storage systems. 
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Figure 10: System exhibiting linear segments. 


This type of behavior was the motivation for 
developing the piecewise linear regression algorithm. 
The data prior to May is useless for prediction since 
it significantly different from the current behavior of 
the system. Application of piecewise linear 
regression correctly found a model that fits the data 


from the beginning of June to the last data point. 
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Figure 11: A behavioral change in the rate of storage 
utilization occurred at the end of May, but the 
piecewise linear regression model correctly fit the 
most recent behavior despite noisy data. 
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Figure 12: Shelf was added to an existing Data 
Domain system. 


The total capacity exhibits a discontinuity in May. In 
this figure, the system reached full capacity and then 
a shelf was added. The model fits the recent data and 
predicts the system will reach 100% capacity in 
approximately three months. 
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7 Conclusions and Future Work 


The role of automated predictive modeling for 
managing IT systems will become more pervasive as 
the complexity and size of data centers continue to 
grow. [9] 


This paper describes a model that uses historical data 
to predict when Data Domain systems will reach full 
capacity. Advance notice of storage systems reaching 
full capacity allows system administrators to take 
necessary performance 
degradation and/or failure. It was demonstrated that 


measures to avoid 


many storage systems can be modeled using a 
piecewise linear regression model. Furthermore it 
was shown that for the systems that could be 
modeled, they were able to generate a forecast of the 
date of full capacity in advance. 


Many questions still remain for future analyses which 
are natural extensions of the material discussed in 
this paper: 


1. Are there other applications of predictive 
modeling within the existing data set? Could 
compression ratio, bandwidth throughput, load- 
balancing [10] or IO capacity also be predicted? 

2. Why was the piecewise linear regression model 
not able to model some systems? Could the 
model be improved or could they be modeled by 
some other method? 

3. Using the statistically significant view across the 
install base, could there be correlations between 
system variables or time series correlations for a 
single variable? 


Capacity forecasting is a fundamental utility for 
system management, but it 1s only a starting point of 
the data analysis that can be explored for storage 
management. 
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Abstract 


When backing up a large number of computer systems 
to many different storage devices, an administrator has 
to balance the workload to ensure the successful com- 
pletion of all backups within a particular period of time. 
When these devices were magnetic tapes, this assign- 
ment was trivial: find an idle tape drive, write what fits on 
a tape, and replace tapes as needed. Backing up data onto 
deduplicating disk storage adds both complexity and op- 
portunity. Since one cannot swap out a filled disk-based 
file system the way one switches tapes, each separate 
backup appliance needs an appropriate workload that fits 
into both the available storage capacity and the through- 
put available during the backup window. Repeating a 
given client’s backups on the same appliance not only 
reduces capacity requirements but it can improve per- 
formance by eliminating duplicates from network traf- 
fic. Conversely, any reconfiguration of the mappings of 
backup clients to appliances suffers the overhead of re- 
populating the new appliance with a full copy of a client’s 
data. Reassigning clients to new servers should only be 
done when the need for load balancing exceeds the over- 
head of the move. 

In addition, deduplication offers the opportunity for 
content-aware load balancing that groups clients to- 
gether for improved deduplication that can further im- 
prove both capacity and performance; we have seen a 
system with as much as 75% of its data overlapping other 
systems, though overlap around 10% is more common. 
We describe an approach for clustering backup clients 
based on content, assigning them to backup appliances, 
and adapting future configurations based on changing re- 
quirements while minimizing client migration. We de- 
fine a cost function and compare several algorithms for 
minimizing this cost. This assignment tool resides in a 
tier between backup software such as EMC NetWorker 
and deduplicating storage systems such as EMC Data 
Domain. 


“Work done during an internship. 


Tags: backups, configuration management, infrastruc- 
ture, deduplication 


1 Introduction 


Deduplication has become a standard component of 
many disk-based backup storage environments: to keep 
down capacity requirements, repeated backups of the 
same pieces of data are replaced by references to a single 
instance. Deduplication can be applied at the granular- 
ity of whole files, fixed-sized blocks, or variable-sized 
“chunks” that are formed by examining content [12]. 

When a backup environment consists of a handful of 
systems (or “clients’”) being backed up onto a single 
backup appliance (or “server’’), provisioning and config- 
uring the backup server is straightforward. An organi- 
zation buys a backup appliance that is large enough to 
support the capacity requirements of the clients for the 
foreseeable future, as well as capable of supporting the 
I/O demands of the clients. That is, the backup appliance 
needs to have adequate capacity and performance for the 
systems being backed up. 

As the number of clients increases, however, opti- 
mizing the backup configuration is less straightforward. 
A single backup administration domain might manage 
thousands of systems, backing them up onto numerous 
appliances. An initial deployment of these backup appli- 
ances would require a determination of which clients to 
back up on which servers. Similar to the single-server en- 
vironment, this assignment needs to ensure that no server 
is overloaded in either capacity or performance require- 
ments. But the existence of many available servers adds 
a new dimension of complexity in a deduplicating en- 
vironment, because some clients may have more con- 
tent in common than others. Assigning similar clients 
to the same server can gain significant benefits in capac- 
ity requirements due to the improved deduplication; in a 
constrained environment, assigning clients in a content- 
aware fashion can make the difference between meeting 
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one’s capacity constraints and overflowing the system. 

The same considerations apply in other environments. 
For example, the “clients” being backed up might ac- 
tually be virtual machine images. VMs that have been 
cloned from the same “golden master’ are likely to have 
large pieces in common, while VMs with different his- 
tories will overlap less. As another example, the sys- 
tems being copied to the backup appliance might be 
backup appliances themselves: some enterprises have 
small backup systems in field offices, which replicate 
onto larger, more centralized, backup systems for disas- 
ter recovery. 

Sending duplicate content to a single location can not 
only decrease capacity requirements but also improve 
performance, since content that already exists on the 
server need not be transferred again. Eliminating dupli- 
cates from being transmitted is useful in LAN environ- 
ments [5] and is even more useful in WAN environments. 

Thus, in a deduplicating storage system, content- 
aware load balancing 1s desirable to maximize the ben- 
efits of deduplication. There are several considerations 
relating to how best to achieve such balance: 


Balancing capacity and throughput Above all, the 
system needs to assign the clients in a fashion that min- 
imizes hot spots for storage utilization or throughput. 
Improvements due to overlap can further reduce capac- 
ity requirements. 


Identifying overlap How does the system identify how 
much different clients have in common? 


Efficiency of assignment What are the overheads asso- 
ciated with assignment? 


Coping with overload If a server becomes overloaded, 
what is the best way to adapt, and what are the costs of 
moving a client from that server? 


Our paper has three main contributions: 


1. We define a cost function for evaluating potential as- 


signments of clients to backup servers. This function 
permits different configurations to be compared via a 
single metric. 


2. We present several techniques for performing these 


assignments, including an iterative refinement heuris- 
tic for optimizing the cost function in a content-aware 
fashion. 


3. We compare multiple methods for assessing content 


overlap, both for collecting content and for clustering 
that content to determine the extent of any overlap. 


Our assignment algorithm serves as a middleware 
layer that sits between the backup software and the un- 
derlying backup storage appliances. Our ultimate goal is 
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a fully automated system that will dynamically reconfig- 
ure the backup software as needed. As an initial pro- 
totype, we have developed a suite of tools that assess 
overlap, perform initial assignments by issuing recom- 
mendations for client-server assignments, and compute 
updated assignments when requirements change. Client 
assignments can be converted into a sequence of com- 
mands to direct the backup software to (re)map clients to 
specific storage appliances. 

The rest of this paper is as follows. The next section 
provides more information about deduplication for back- 
ups and other related work. 83 provides use cases for 
the tool. 84 describes load balancing in more detail, in- 
cluding the “cost” function used to compare configura- 
tions and various algorithms for assignment of clients to 
servers. §5 discusses the question of content overlap and 
approaches to computing it. 86 presents results of sim- 
ulations on several workloads. 87 examines some alter- 
native metrics and approaches. Finally, 88 discusses our 
conclusions and open issues. 


2 Background and Related Work 


2.1 Evolving Media 


In the past decade, many backup environments have 
evolved from tape-centric to disk-centric. Backup soft- 
ware systems, such as EMC NetWorker [6], IBM Tivoli 
Storage Manager [9], or Symantec NetBackup [19], date 
to the tape-based era. With tapes, a backup server could 
identify a pool of completely equivalent tape drives on 
which to write a given backup. When data were ready 
to be written, the next available tape drive would be 
used. Capacity for backup was not a critical issue, 
since it would usually be simple to buy more magnetic 
tape. The main constraint in sizing the backup envi- 
ronment would be to ensure enough throughput across 
the backup devices to meet the “backup window,” i.e., 
the time in which all backups must complete. Some 
early work in this area includes the Amanda Network 
Backup Manager [16, 17], which parallelized worksta- 
tion backups and created schedules based on anticipated 
backup sizes. Interleaving backup streams is necessary 
to keep the tapes busy and avoid “shoe-shining” from un- 
derfull buffers, but this affects restore performance [20]. 
The equivalence of the various tape drives, however, 
made parallelization and interleaving relatively straight- 
forward. 

Disk-based backup grew out of the desire to have 
backup data online and immediately accessible, rather 
than spread across numerous tapes that had to be located, 
mounted, and sequentially accessed in case of data loss. 
Deduplication was used to reduce the capacity require- 
ments of the backup system, in order to permit disk- 
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based backup to compete financially with tape. The most 
common type of deduplication breaks a data stream into 
“chunks,” using features of the data to ensure that most 
small changes to the data do not affect the chunk bound- 
aries. This way, inserting a few bytes early in a file 
might change the chunk where the insertion occurs, but 
the rest of the file will deduplicate. Deduplicating sys- 
tems use a strong hash (known as a “‘fingerprint’’) of the 
content to identify when a chunk already exists in the 
system [15, 21]. 


2.2 Deduplication: Challenges and Oppor- 
tunities 


With deduplicated disk backups replacing tape, the 
equivalence of appliances is partly lost. Writing to the 
same storage system gains efficiencies by suppressing 
duplicate data; these efficiencies can be further reflected 
back to the backup server or even the client being backed 
up, if the duplicates are identified before data cross the 
network [5]. The effort of dividing the content into 
chunks and computing fingerprints over the chunks can 
be distributed across the backup infrastructure, allowing 
the storage appliance to scale to more clients and reduc- 
ing network traffic when the deduplication rate is high. 

Thus, the “stickiness” of the assignment of a client to a 
storage appliance changes the role of the backup admin- 
istrator. Instead of simply pooling many clients across 
many tape drives, the mapping of clients to storage ap- 
pliances needs to be done a priori. Once a client has 
been paired with a particular storage appliance, it gets 
great benefits from returning to that appliance and omit- 
ting duplicates. Should it move to a different appliance, 
it must start over, writing all of its data anew. But if its 
target appliance is overloaded, the client queues up and 
waits longer than desired, possibly causing the backup 
not to complete within its “backup window.” Capacity is 
similarly problematic, since a client that is being backed 
up onto a full storage appliance either is not protected or 
must move to another less loaded system and pay a cost 
for copying data that would otherwise have been sup- 
pressed through deduplication. 

In summary, once a client is backed up onto a particu- 
lar storage appliance, there is a tension between the bene- 
fits of continuing to use it and the disadvantages that may 
ensue from overload; at some tipping point, the client 
may move elsewhere. It then pays a short-term overhead 
(lack of deduplication) but gets long-term benefits. 

Another interesting challenge relating to deduplicating 
storage 1s anticipating when it will fill up. One needs to 
consider not only how much 1s written but also how well 
that data will deduplicate. Predictions of future capac- 
ity requirements on a device-by-device basis, based on 
mining past load patterns [2], would feed into our load 
balancing framework. 


2.3 Load Balancing 


Finally, the idea of mapping a set of objects to a set 
of appropriate containers is well-known in the systems 
community. Load balancing of processor-intensive ap- 
plications has been around for decades [3, 8], includ- 
ing the possibility of dynamically reassigning tasks when 
circumstances change or estimates prove to be inaccu- 
rate [13]. More recently, allocating resources within grid 
or cloud environments is the challenge. Allocating vir- 
tual resources within the constraints of physical datacen- 
ters is particularly problematic, as one must deal with 
all types of resources: processor, memory, storage, and 
network [18]. There are many examples of provisioning 
systems that perform admission control, load balancing, 
and reconfiguration as requirements change (e.g., [7]), 
but we are unaware of any work that does this in the con- 
text of deduplication. 


3 Use Cases 


In this section we describe the motivation behind this 
system in greater detail. Figure 1 demonstrates the 
basic problem of assigning backups from clients to 
deduplicated storage systems, and there are a number of 
ways in which automated content-aware assignment can 
be useful. 


Sizing and deployment Starting with a “clean slate,’ 
an administrator may have a large number of client ma- 
chines to be backed up on a number of deduplicating 
storage appliances. The assignment tool can use in- 
formation about the size of each client’s backups, the 
throughput required to perform the backups, the rate 
of deduplication within each client’s backups, the rate 
at which the backup size is expected to change over 
time, and other information. With this data it can es- 
timate which storage appliances will be sufficient for 
this set of clients. Such “sizing tools” are common- 
place in the backup industry, used by vendors to aid 
their customers in determining requirements. Using 
information about overlapping content across clients 
allows the tool to refine its recommendations, poten- 
tially lowering the total required storage due to im- 
proved deduplication. 


First assignment Whether the set of storage appliances 
is determined via this tool or in another fashion, once 
the capacity and performance characteristics of the 
storage appliances are known, the tool can recom- 
mend which clients should be assigned to which stor- 
age system. For the first assignment, we assume that 
no clients are already backed up on any storage ap- 
pliance, so there is no benefit (with respect to dedu- 
plication) to preferring one appliance over another for 
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Figure 1: Backups from numerous clients are handled 
by the backup manager, which assigns clients to dedu- 
plicated storage while accounting for content overlap be- 
tween similar clients. In this case, there was insufficient 
room on a single storage node for all three large servers, 
so one was placed elsewhere. 


individual clients, though overlapping content is still a 
potential factor. 


Reconfigurations Once a system is in steady state, there 
are a number of possible changes that could result in 
reconfiguration of the mappings. Clients may be added 
or removed, and backup storage appliances may be 
added. Storage may even be removed, especially if 
backups are being consolidated onto a smaller num- 
ber of larger-capacity servers. Temporary failures may 
also require reconfiguration. Adding new clients and 
backup storage simultaneously may be the simplest 
case, in which the new clients are backed up to the 
new server(s). More commonly, extra backup capac- 
ity will be required to support the growth over time of 
the existing client population, so existing clients will 
be spread over a larger number of servers. 


Disaster recovery As mentioned in the introduction, 
the “clients” might be backup storage appliances them- 
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selves, which are being replicated to provide disaster 
recovery (DR). In terms of load balancing, there is lit- 
tle distinction between backing up generic computers 
(file servers, databases, etc.) and replicating dedupli- 
cating backup servers. However, identifying content 
overlap is easier in the latter case because the con- 
tent is already distilled to a set of fingerprints. Also, 
DR replication may be performed over relatively low- 
bandwidth networks, increasing the impact of any re- 
configuration that results in a full replication to a new 
server. 


4 Load Balancing and Cost Metrics 


In order to assign clients to storage appliances, we need 
a method for assessing the relative desirability of differ- 
ent configurations, which is done with a cost function 
described in 84.1. Given this metric, there are different 
ways to perform the assignment and evaluate the results. 
In 84.2, we describe and compare different approaches, 
both simple single-pass techniques that do not explicitly 
optimize the cost function and a heuristic for iteratively 
minimizing the cost. 


4.1 Cost Function 


The primary goal of our system is to assign clients 
to backup servers without overloading any individual 
server, either with too much data being stored or too 
much data being written during a backup window. We 
define a cost metric to provide a single utility value for 
a given configuration. The cost has several components, 
representing skew, overload, and movement, as shown in 
Table 1. 

The basic cost represents the skew across storage and 
throughput utilizations of the various servers, and when 
the system is not heavily loaded it is the dominant com- 
ponent of the total cost metric. Under load, the cost goes 
up dramatically. Exceeding capacity is considered fatal, 
in that it is not a transient condition and cannot be recov- 
ered from without allocating new hardware or deleting 
data. Exceeding throughput is not as bad as exceeding 
capacity, as long as there is no “hard deadline” by which 
the backups must complete — in that event, data will not 
be backed up. Even if not exceeded, the closer capacity 
or throughput is to the maximum allowable, the higher 
the “cost” of that configuration. In contrast, having a 
significantly lower capacity utilization than is allowable 
may be good, but being 50% full is not “twice as good” 
as being 100% full. As a result, the cost is nonlinear, 
with dramatic increases close to the maximum allowed 
and jumps to extremely high costs when exceeding the 
maximum allowed. Finally, there are costs to reassign- 
ing clients to new servers. We cover each in turn. 
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weights emphasize capacity skew (80%) over through- 
put skew (20%). 


values 
| weighted sum ae : 
of skews 


Overflowing Clients 


There are then some add-ons to the cost to account for 
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Table 1: Components of the cost function and related 
variables. 


Skew 


The basic cost starts with a weighted sum of the standard 
deviations of the capacity and throughput utilizations of 
the storage appliances: 


Chase = aDs le (1 — a)Dr, 


where @ is a configurable weight (defaulting to 0.8), Ds 
is the standard deviation of storage utilizations U;,, (the 
storage utilization of node i is between 0 and 1, or above 
1 if node i is overloaded), and Dr is the standard devia- 
tion of throughput utilizations U;; (similar definition and 
range). The notion is that if predicted utilization is com- 
pletely equal, there is no benefit to adjusting assignments 
and increasing that skew; however, one might redefine 
this metric to exclude one or more systems explicitly tar- 
geted to have excess capacity for future growth. Since 
throughput is more dynamic than capacity, the default 


number of 
m 
servers 


penalties. First and foremost, if a server would be unable 
to fit all the clients assigned to it, there is a penalty for 
each client that does not fit: 


m 
Crit = fit_penalty_factor y F;, 

i=1 
F; is the number of clients not fitting on node 7, and the 
penalty factor used in our experiments 1s a value of 1000 
per excess host. We use 1000 as a means of ensuring 
a step function: even if one out of many servers is just 
minimally beyond its storage capacity (1.e., a utilization 
of 1.00...01) the cost will jump to 1000+. In addition, 
when several clients do not fit, the contribution to total 
cost from the fit penalty is in the same range as the con- 
tribution from the utilization (see below). 

To count excess clients, we choose to remove from 
smallest to largest until capacity is not exceeded: this 
ensures that the greatest amount of storage 1s still allo- 
cated, but it does have the property that we could pe- 
nalize many small clients rather than removing a single 
large one. We consider the alternate approach, removing 
the largest client first, in $7.2. 


Utilization 


There are also level-based costs. There are two thresh- 
olds, an upper threshold above (100%), a clearly unac- 
ceptable state, and a /ower threshold (80% of the max- 
imum capacity or throughput) that indicates a warning 
zone. The costs are marginal, similar to the U.S. tax 
system, with a very low cost for values below the lower 
threshold, a moderate cost for values between the lower 
and upper thresholds, and a high cost for values above 
the upper threshold, the overload region. Since the costs 
are marginal, the penalty for a value just above a thresh- 
old is only somewhat higher than a value just below 
that threshold, but then the increase in the penalty grows 
more quickly with higher values. 

The equation for computing the utilization cost of a 
configuration is as follows. Constant scaling factors 
ranging from 10—10000 are used to separate the regions 
of bad configurations: all are bad, but some are worse 
than others and get a correspondingly higher penalty. 
The weight of 0.1 for the more lightly loaded utiliza- 
tion makes adjustments in the range of the other penalties 
such as utilization skew. Each range of values inherits 
from the lower ranges; for example, if U;,s > 1 then its 
penalty is 10,000 for everything above the threshold of 
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Figure 2: The cost associated with storage on a node, 
S;, depends on whether the utilization falls into the low 
region, the warning region, or the overload region. The 
costs are cumulative from one region to a higher one. 


1, added to the penalty for values between 0.8 and 1 (100 
* 2, the size of that region) and the penalty for values be- 
tween 0 and 0.8 (.1 * .8, the size of that region). Figure 2 
provides an example with several possible utilization val- 
ues within and between the regions of interest. 


m 
Cua = ) Sit Fj 
i=l 
1 * Uj s, Ui's < .8 
= .l*«.8+ 100 * (Uis — .8), 8 <Ujis <1 
.1*.8+ 100 « .2 + 10000 * (Ui s5—1), Uis >1 
0, Ui t = 8 
[= 10 x (Uist — 8), 8< Ui, <1 
10*.2+1000 x (Uit —1), U;;> 1 


The highest penalty is for being > 100% storage ca- 
pacity, followed by being > 100% throughput. If an 
appliance is above the lower threshold for capacity or 
throughput, a lesser penalty is assessed. If it is below 
the lower threshold, no penalty is assessed for through- 
put, and a small cost is applied for capacity to reflect 
the benefit of additional free space. (Generally, a de- 
crease on one appliance is accompanied by an increase 
on another and these costs balance out across configu- 
rations, but content overlap can cause unequal changes. ) 
These penalties are weights that vary by one or more or- 
ders of magnitude, with the effect that any time one or 
more storage appliances is overloaded, the penalty for 
that overload dominates the less important factors. Only 
if no appliance has capacity or throughput utilization sig- 
nificantly over the lower threshold do the other penalties 
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such as skew, data movement, and small differences in 
utilization, come into play. 

Within a given cost region, variations in load still pro- 
vide an ordering: for instance, if a server is at 110% of its 
capacity and a change in assignments brings it to 105%, 
it is still severely loaded but the cost metric is reduced. 
As aresult, that change to the configuration might be ac- 
cepted and further improved upon to bring utilization be- 
low 100% and, hopefully, below 80%. Dropping capac- 
ity below 100% and avoiding the per-client penalties for 
the clients that cannot be satisfied is a big win; this could 
result in shifting a single large client to an overloaded 
server in order to fit many smaller ones. Conversely, the 
reason for the high penalty for each client that does not fit 
is to ensure that the cost encompasses not only the mag- 
nitude of the capacity gap but also the number of clients 
affected, but there is a strong correlation between Cy; 
and C,,;;; in cases of high overload. 


Movement 


The final cost is for data movement: if a client was pre- 
viously assigned to one system and moves to another, a 
penalty is assessed in proportion to that client’s share of 
the original system’s capacity. This penalty is weighted 
by a configurable “movement penalty factor.’ Thus, a 
client with 1TB of post-dedupe storage, moving from a 
30-TB server, would add movement _penalty_f actor x + 
to the configuration cost. 


S1ZC client 


M; = movement_penalty_factor * 


clients; S1Z2e; 


m 
Caan = yi Mi; 
i=1 


Movement_penalty_factor defaults to 1, which also re- 
sults in the adjustment to the cost being in the same range 
as skew, though the movement_penalty_factor could be 
higher in a WAN situation. We discuss other values be- 
low. 

In total, the cost C for a given configuration 1s: 


C= Chasic a Crit a Cutil + Cmovement 


The most important consideration in evaluating a cost 
is whether it indicates overload or not; among those with 
low enough load, any configuration is probably accept- 
able. In particular, penalties for movement are inher- 
ently lower than penalties for overload conditions, and 
then among the non-overloaded configurations, any with 
movement is probably worse than any that avoids such 
movement. Thus the weight for the movement penalty, 
if at least 1 and not orders of magnitude higher, has little 
effect on the configuration selected. 
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4.2 Algorithmic Approaches 


We considered four methods of assigning clients; three 
are fairly simple and efficient but tend to work badly 
in overloaded environments, while the fourth is much 
more computationally expensive but can have significant 
benefits. In all cases, if a configuration includes prede- 
termined assignments, those assignments are made first, 
and then these methods are used to assign the remaining 
clients. Existing assignments can constrain the possible 
assignments in a way that makes imbalance and overload 
likely, if not unavoidable. 

The three “simple” algorithms are as follows. None 
of them takes content overlap into account in selecting a 
server for a particular client. However, they do a limited 
form of accounting for content overlap once the server is 
assigned. The next section discusses extensive analysis 
to compute pair-wise overlaps between specific hosts, 
but computing the effects of the pair-wise overlaps 
as each client is assigned is expensive. The simple 
algorithms instead consider only class-wise overlaps, in 
which the presence of another client in the same “class” 
as the client most recently added is assumed to provide a 
fixed reduction to the newer client’s space requirements. 
That reduction is applied to that client before continuing, 
sO servers with many overlapping clients can be seen 
to have additional capacity. The final, more precise, 
cost calculation is performed after all assignments are 
completed. 


Random (RAND) Randomly assign clients to backup 
servers. RAND picks from all servers that have avail- 
able capacity. If a client does not fit on the selected 
server, it then checks each server sequentially. If it 
wraps around to the original selection and the client 
does not fit, the client is assigned to the first choice, 
whose utilization will now be > 1, and the cost metric 
will reflect both the high utilization and the overflow- 
ing client. By default, we run RAND 10 times and take 
the best result, which dramatically improves the out- 
come compared to a single run [14]. 


Round-robin (RR) Assign clients to servers in order, re- 
gardless of the size of the client. Again, if a server does 
not have sufficient capacity, the next server in order 
will be tried; if no server is sufficient, the first one will 
be used and an overflowing client will be recorded. 


Bin-packing (BP) Assign based on capacity, in decreas- 
ing order of required capacity, to the server with the 
most available space. If no server has sufficient ca- 
pacity, the one with the most remaining capacity (or 
the least overflow) will be selected and the overflow- 
ing client will be recorded. 


utilization 





Figure 3: With simulated annealing, the system tries 
Swapping or moving individual clients to improve the 
overall system cost. Here, the different shapes are as- 
sumed to deduplicate well against each other, so swap- 
ping a circle with a triangle reduces the load of both 
machines. Then moving a circle and a triangle from 
the overloaded server on the left onto the other systems 
increases their loads but decreases the leftmost server’s 
load. The arrows represent storage utilization, with the 
red ones highlighting overload. The dark borders and 
unshaded shapes represent new or removed assignments, 
respectively. 


The fourth algorithm bears additional detail. It is the 
only one that dynamically reassigns previously assigned 
clients, trading a movement penalty for the possible ben- 
efit of lowered costs in other respects. It does a full cost 
calculation for each possible assignment, and does many 
possible assignments, so it is computationally expensive 
by comparison to the three previous approaches. 


Simulated annealing (SA) [11] Starting with the result 
from BP, perturb the assignments attempting to lower 
the cost. At each step, a small number of servers are 
selected, and clients are either swapped between two 
servers or moved from one to another (see Figure 3). 
The probability of movement is higher initially, and 
over time it becomes more likely to swap clients as 
a way of reducing the impact. The cost of the new 
configuration is computed and compared with the cost 
of the existing configuration; the system moves to the 
new configuration if it lowers the cost or, with some 
smaller probability, if the cost does not increase dra- 
matically. The configuration with the lowest cost 1s 
always remembered, even if the cost is temporarily in- 
creased, and used at the end of the process. 


We use a_ modified 
MachineLearning: : [IntegerAnnealing 


version of the Perl 
library, ! 


'This library appears to have been superseded by _ the 
Al::SimulatedAnnealing __ library, http://search.cpan.org/ 
~pfitch/AI-SimulatedAnnealing-1.02/. 
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which allows some control over the way in which the 
assignments are perturbed: 


e The algorithm accepts a set of initial assign- 
ments, rather than starting with random assign- 
ment. 


e It accepts a specification of the percent of as- 
signments to change in a given “trial,’ when it 
tries to see if a change results in a better out- 
come. This percentage, which defaults to 10%, 
decreases over time. 


e The probability of moving a client from one stor- 
age appliance to another or swapping it with a 
client currently assigned to the other appliance is 
configurable. It starts at ‘ and declines over time. 


e The choice of the target systems for which to 
modify assignments can be provided externally. 
This allows it to focus on targets that are over- 
loaded rather than moving assignments among 
equally underloaded systems. 


By default, SA is the only algorithm that reassigns a 
client that has already been mapped to a specific stor- 
age appliance (we consider a simple alternative to this 
for the other algorithms in 87.1). 


We evaluate the effectiveness of these algorithms in 
$6.3. In general, RAND and RR work “well enough” if 
the storage appliances are well provisioned relative to 
the client workload and the assignments are made on an 
empty system. However, if we target having each sys- 
tem around 80-90% storage utilization or adjust a sys- 
tem that was overloaded prior to adding capacity, these 
approaches may result in high skew and potential over- 
load. BP works well in many of the cases, and SA further 
improves upon BP to a limited extent in a number of cases 
and to a great extent in a few extreme examples. SA has 
the greatest benefit when the system is overloaded, es- 
pecially if the benefits of content overlap are significant, 
but in some cases it 1s putting lipstick on a pig: it lowers 
the cost metric, but the cost is still so high that the dif- 
ference is not meaningful. Naturally, the solution in such 
cases 1s to add capacity. 


5 Computing Overlap 


There are a number of ways by which one can deter- 
mine the overlap of content on individual systems. In 
each case we start with a set of “fingerprints” represent- 
ing individual elements of deduplication, such as chunks. 
These fingerprints need not be as large as one would use 
for actual deduplication. (For instance, a 12-byte finger- 
print with a collective false positive rate of HI is fine for 


LISA °11: 25th Large Installation System Administration Conference 


estimating overlap even if it would be terrible for actu- 
ally matching chunks — for that one might use 20 bytes or 
more, with a false positive rate of 06 +) The fingerprints 
can be collected by reading and chunking the file system, 
or by looking at existing backups that have already been 
chunked. 

Given fingerprints for each system, we considered two 
basic approaches to computing overlap: sort-merge and 
Bloom filters [1]. 

With sort-merge, the fingerprints for each system are 
sorted, then the minimal fingerprint across all systems is 
determined. That fingerprint is compared to the mini- 
mal fingerprint of all the systems, and a counter is incre- 
mented for any systems that share that fingerprint, such 
that the pair-wise overlap of all pairs of systems is calcu- 
lated. After that fingerprint is removed from the ordered 
sets containing it, the process repeats. 

With Bloom filters, the systems are processed sequen- 
tially. Fingerprints for the first system are inserted into 
its Bloom Filter. Then for each subsequent system, fin- 
gerprints are added to a new Bloom filter, one per system. 
When these fingerprints are new to that system, they are 
checked against each of the previous systems, but not 
added to them. 

The sort-merge process can be precise, if all finger- 
prints are compared. Bloom filters have an inherent error 
rate, due to false positives when different insertions have 
collectively set all the bits checked by a later data ele- 
ment. However, that false positive rate can be fairly low 
(say 0.001%), depending on the size of the Bloom filter 
and the number of functions used to hash the data. 

If the Bloom filters are all sufficiently sparse after 
all insertions have taken place, another way to estimate 
overlap is to count the number of intersecting bits that 
have been set in the bit-vector; however, for “standard- 
size” Bloom filters setting multiple bits per element in- 
serted, we found it is easy to have a 1% overlap of fin- 
gerprints result in 20-30% overlap in bits. Each filter 
would need to be scaled to be significantly larger than 
would normally be required for a given number of ele- 
ments, which would in turn put more demands on system 
memory, or the number of bits set for each entry would 
have to be reduced, increasing the rate of false positives. 
(Consistent with this result, Jain, et al. [10], reported a 
detailed analysis of the false positive rate of intersect- 
ing Bloom filters, finding that it is very accurate when 
there is high overlap but remarkably misleading in cases 
of little or no overlap. Since we expect many systems 
to overlap by 0—20% rather than 75—100%, Bloom filter 
intersection would not be helpful here.) 

Regardless of which approach is used, there is an ad- 
ditional concern with respect to clustering more than two 
clients together. Our goal is to identify what fraction of 
a new client A already exists on a system containing data 
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(a) Complete overlap 


(b) Subset overlap 


(c) Distinct sets (d) Partial overlap 


Figure 4: Four views of possible overlap among A, B, and C. The red or magenta areas indicate overlap that can be 
attributed to a single pair. The yellow area indicates overlap that must be attributed to multiple intersecting datasets. 


from clients B, C, ...Z. This is equivalent to taking the 
intersection of A’s content with the union of the content 
of the clients already present: 


Dup(A) =AN(BUCU...UZ) 


However, we cannot store the contents of every client 
and recompute the union and intersection on the fly. To 
get an accurate estimate of the intersection, we ideally 
want to precompute and store enough information to es- 
timate this value for all combinations of clients. If we 
only compute the number of chunks in common between 
A and B, A and C, and B and C, then we don’t know how 
many are shared by all of A, B, and C. For example, if 
ANB = 100, ANC = 100, and BNC = 100, ANBNC 
may be 100 as well, or it may be O. If A and B are al- 
ready assigned to a server and then C is added to it, C 
may have as little as 100 in common with the existing 
server or it may have as many as 200 overlapping. The 
value of AN BNC provides that quantity. 

Figure 4 depicts some simple scenarios in a three- 
client example. In the first two cases, C C B, so even 
though C overlaps with A the entire overlap can be com- 
puted by looking at A and B. In the third case, B and C 
are completely distinct, and so if A joined a storage appli- 
ance with B and C the content in AM B and ANC would 
all be duplicates and the new data would consist of the 
size of A minus the sizes of AB and ANC. The last 
case shows the more complicated scenario in which B 
and C partially intersect, and each intersects A. Here, the 
yellow region highlights an area where A intersects both 
B and C, so subtracting AMB and ANC from A’s size 
would overestimate the benefits of deduplication. The 
size of the region AM BMC must be counted only once. 

Therefore, the initial counts are stored for the largest 
group of clients. By counting the number of chunks in 
common among a set S$ of clients, we can enumerate the 
2'S| subsets and add the same number of matches to each 
subset. Then, for each client C, we can compute the frac- 
tion of its chunks that are shared with any set of one or 
more other clients; this similarity metric then guides the 
assignment of clients to servers. 


To keep the overhead of the subset enumeration from 
being unreasonable, we cap the maximum value of |S]. 
Fingerprints that belong to > Sq, clients are shared 
widely enough not to be interesting from the perspec- 
tive of content-aware assignment, for a couple of rea- 
sons: first, if more clients share content than would be 
placed on a single storage appliance, the cluster will be 
broken up regardless of overlap; and second, the more 
clients sharing content, the greater the odds that the con- 
tent will exist on many storage appliances regardless of 
content-aware assignment. Empirically, a good value of 
Smax iS in the range E >|. 

In summary, for each client, we can compute the fol- 
lowing information: 


e What fraction of its chunks are completely unique 
to that client, and will not deduplicate against any 
other client? This value places an upper bound on 
possible deduplication. 


e What fraction of its chunks are shared with at 
least Sjax — 1 clients? We assume these chunks 
will deduplicate on any appliance that already 
stores other clients, providing an approximate lower 
bound on deduplication, but there is an inherent er- 
ror from such an assumption: if the Sq, — 1 clients 
are all on a single appliance, the S“ client will only 
get the additional deduplication if it is co-resident 
with these others. 


e How much does the client deduplicate against each 
other client, excluding the common chunks? 


Combining the per-pair overlaps with per-triple data, we 
can identify the best-case client with which to pair a 
given client for maximum deduplication, then the best- 
case second client that provides the most additional 
deduplication beyond the first matching client. 86.2 de- 
scribes the results of this analysis on a set of 21 Linux 
systems. Since even the 3” client is usually a marginal 
improvement beyond the 2”“, we do not use overlap be- 
yond pairwise intersections in our experiments. 
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5.1 Approximation Techniques 


Dealing with millions of fingerprints, or more, is un- 
wieldy. In practice, as long as the fingerprints are uni- 
formly distributed, it is possible to estimate overlap by 
sampling a subset of fingerprints. This sampling is sim- 
ilar to the approach taken by Dong, et al., when routing 
groups of chunks based on overlapping content [4], ex- 
cept that the number of chunks in a group was limited 
to a relatively small number (200 or so). Thus in that 
work, the quality of the match degraded when sampling 
fewer than 5 fingerprints, but when characterizing en- 
tire multi-GB or multi-TB datasets, we have many more 
fingerprints to choose from. Empirically, sampling | in 
1024 fingerprints has proven to be about as effective as 
using all of them; we discuss this further in §6.2.1. 


In addition, it is possible to approximate the effect of 
larger clusters by pruning the counts of matches when- 
ever the number is small enough. For instance, if AM B is 
10% of A and 5% of B, ANC 1s 15% of A and 5% of C, 
and AN BMC 1s 0.5% of A, then we estimate from ANB 
and AMC that adding A to B and C will duplicate 25% of 
A’s content. This overestimates the duplication by 0.5% 
of A since it counts that amount twice, but the adjustment 
is small enough not to affect the outcome. Similarly, in 
Figure 4d, the yellow region of overlap AN BNC is much 
greater than the intersection only between A and C that 
does not include B: adding A to B and C is approximately 
the same as adding A to B alone, and C can be ignored if 
it is co-resident with B. 


This approximation does not alleviate the need to com- 
pute the overlap in the first place, since it is necessary to 
do the comparisons in order to determine when overlap is 
negligible. But the state to track each individual combi- 
nation of hosts adds up; therefore, it is helpful to compute 
the full set, then winnow it down to the significant over- 
laps before evaluating the best way to cluster the hosts. 
This filter can be applied all the way at the level of pairs 
of clients, ignoring pairs that have less than some thresh- 
old (such as 5%) of the content of at least one client in 
common. 


6 Evaluation 


In this section we describe the use of the client as- 
signment tool in real-world and simulated environments. 
$6.1 discusses the datasets used, 86.2 reports some exam- 
ples of overlapping content and the impact of sampling 
the dataset fingerprints, and 86.3 compares the various 
algorithms introduced in 84.2. 
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6.1 Datasets 


To evaluate our approach, we draw from three datasets: 


1. Linux workstations, full content. We have a set of 21 
collections of fingerprints of content-defined chunks 
on individual Linux workstations and file servers. 
Most of these are drawn from a canonical internal test 
dataset” from 2005-6 containing full and incremental 
backups of workstations over a period of months; since 
duplicate fingerprints are ignored, this is the union of 
all content on these systems over that period (exclud- 
ing any data that never makes it to a backup). About i 
are from a set of workstations and file servers currently 
in the Princeton EMC office, collected in 2011 through 
a single pass over each local file system. 


2. Artificial dataset, no content. In order to show the ef- 
fect of repeatedly adding clients to a system over time, 
we generated an artificial dataset with a mix of three 
client sizes. Each iteration, the system adds 20 clients: 
10 small clients with full backups of 20GB, 7 medium 
100GB clients, and 3 big 2TB clients. This adds up to 
6.9TB of total full backups, which scales to about 8TB 
of unique data to be retained over a period of several 
months. We simulate writing the datasets onto a num- 
ber of DD690 backup systems with 35.3TB storage 
each; after deduplication, about 5 sets of clients (100 
clients in total) can fit on one such appliance. We start 
with 2 servers and then periodically add capacity: the 
goal is to go from comfortable capacity to overload, 
then repeatedly add a server and add more clients until 
overloaded once again. This can be viewed as an outer 
loop, in which DD690 appliances are added, and an in- 
ner loop, in which 20 clients are assigned per iteration. 
Once assigned to a server, a client starts with a pref- 
erence for that server, except for when a new backup 
server is added: to give the the non-migrating algo- 
rithms a chance to rebalance, the previous assignments 
are forgotten with 5 probability. 


We consider two types of overlap, one in which there 
is a small set of clients with high overlap, and one in 
which all clients of a “class” have small overlap. In 
the former case, each client added during an iteration 
of the outer loop deduplicates 30% of its content with 
the corresponding clients from previous iterations of 
the outer loop: the i” client added when there were 6 
DD690s dedupes well with the i” client added when 
there were [2..5] DD690s present. It deduplicates 10% 
with all other clients of the same type (big, medium, 
or small). In the latter case, only the 10% per-class 
overlap applies. 


>This is the “workstations” dataset in a previous paper [4]. 
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Widely ct 
Shared Cininke Unique Match1 | Match2 | Saved2 
Chunks host host (in iso- 


lation) 


Phostl_| 7387 || 73.75 | 077 | 489 | 2408 || 823,256 | 215,083 | _hosi21 | hostlé | 30.9 
Phos? | 3206 || 3153] 053] 278] 3.19 || 9,065,414 | 6,158,755 | hostlé | host20 | 3.6 
hosts [1868 || 17.21 | 080] 149 | 151 || 3,843,577 | 3,125,766 | __hostd | host20 | _12. 
hosts [1508 || 1453 | 084 | 132 | 050 || 4,852,119 | 4.122.931 | hosts | host20 [| _10.2 
hosts | 1434 || 13.04] 074] 105 | 1.80 || 6,645,378 | 5,692,506 | host? | host20 | _78 
hosts | 13.00_|| 12.43 | 2.82 | 89 | 071 || 7,853,942 | 6,832,555 | host | host | 11 
host? [1167 || 10.19 | 594] 33) 0.95 || 3.460930 | 3,057,042 | hostil | hostile | 2.4 
Phosi® | 1088 || 107] 814| 25] 0.06 || 2.458516 | 2,191,010 | _hosti9 | hostis | __12 
hot? | 103 || 805 | 048 | 5.7 | 1.91 || 31,410,032 | 28,176,318 | _hostl6 | hosts | 22 
PhostI0[94r || 891] 532] 27] 0.89 || 4,195,226 | 3,800,335 | hosti3 | hostis | 22 
Phosttt_| 873 || 616] 281| 19] 146 || 8,066,949 | 7,362,355 | _hosi9 | hosti3 | 1.6 
Phosti2 [836 || 738] 467| 16] 141 || 4,512,303 | 4,135,327 | _host6 | _hosti3 | _14 
hosts [789 || 6.47 | 361] 21 | 076 || 6,231,280 | 5,739,526 | hostil | _hosti8 | __1.3 
Phostié [788 || 728 | 507 | 19] 031 || 4,361,658 | 4,018,166 | _hosti9 | hostil | 16 
hostis_[ 770 || 708 | 434 | 24 034 || 5.141.613 | 4,745,660 | _hostil | _hosti3 | __1.0 
Phostl6[ 736 || 628 | 010] 39 | 2.28 | 64,735,211 | 59,973,773 | hos | host | __27 
PhostiT_[ 7.17 | 652] 227| 33) 096 || 3,035,582 | 2.817.910 | hostié | hosts | _12 
PhostI8 [627 || 5.55] 244 | 22 091 || 9,220,185 | 8,641,937 | _hostlé | hostil | 1.6 
Phost19 [473 || 399] 239 12] 040 || 9,359,512 | 8.917.158 | _hostil | hosts | 04 
Phos20[ 3.07 | 233] 015] 18] 038 || 28,381,188 | 27,508,835 | hosts | hos | _11 
Phosat_[ 177 || 170] 003] 09 | 077 || 43,045,905 | 42,284,086 | _hostl | hostlé | _09 


PAverage| 813] | —*(| | [| 260,699,806 | 23951704 | «| 


Table 2: Inter-host deduplication of 21 workstations 





3. Customer backup metadata, no content. We have a a 
collection of 480 logs of customer backup metadata, 
including such data as the size of each full or incre- 
mental backup, the duration of each backup, the reten- 
tion period, and so on. These logs do not include actual 
content, though they include the “class” that each ma- 
chine being backed up is in: one can infer better over- 
lap between machines in the same class than machines 
in different classes, but not quantify the extent of the ~ host20 a 
overlap. (We assume a 10% overlap for clients in the =f 
same class.) We preprocess these logs to estimate the 
requirements for each client within a given customer 
environment, compute the size and number of backup 
storage appliances necessary to support these clients, 
then assign the clients to this set of storage appliances. 
By adjusting the desired threshold of excess capacity, 
we can vary the contention for storage capacity and 
I/O. In this paper we consider only the largest of these 
customer logs, with nearly 3,000 clients. 
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Figure 5: This is a visual depiction of the data in Table 2, 
showing the components contributions to the deduplica- 
tion of each host. 


6.2 Content Overlap 


USENIX Association 


Table 2 and Figure 5 describe the intersection of the 21 
Linux datasets. Hosts are anonymized via the names 
“hostl” to “host21,’ shown in the first column of the 
table. The next column shows the idealized deduplica- 


tion, computed by dividing the number of unique chunks 
by the number of chunks and subtracting the result from 
100%. (We assume that all chunks are the same size, 
although in practice they are statistically an average of 
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8 Kbytes.) Hosts are sorted by the best possible dedupli- 
cation rate. The average across all hosts is about 8%. 

The next column, Best2, reports the deduplication ob- 
tained by matching a given host against the best two 
hosts, as described in 85. It is the sum of the widely 
shared data appearing on that host (typically around 1|-— 
2% but as high as 8%), the additional deduplication 
specifically against the Match1 host, and the additional 
deduplication against Match2 host. Since the second 
match excludes both common chunks and anything on 
Matchl1, the added benefit from the second host is usu- 
ally under 2%, but in the case of host1, the second 
matching host provides about half as much deduplica- 
tion as the first host, over 24%. Pct Saved2 indicates 
how much deduplication could have been achieved by 
the second host without the first. 

The columns listing which hosts provided the best and 
second-best deduplication indicate that a handful of hosts 
provide most of the matches. Also, the relationships are 
not always symmetric, in part because of varying dataset 
sizes. Host2 is the best match for Host16 and vice- 
versa, but in other cases it is more of a directed graph. 

Figure 5 shows this data visually. The height of each 
bar corresponds to the best possible deduplication, The 
blue bar at the bottom is the percent of chunks on that 
host that appear on many other hosts, the red bar shows 
the additional benefit from the best single match, the 
green bar shows the additional benefit from a second 
host, and the purple bar shows extra deduplication that 
might be obtained through three or more co-resident 
hosts. Not all bars are visible for each host. For the first 
three hosts, arrows identify the matching hosts shown in 
the table. A host with relatively little data may dedupli- 
cate well against a larger host, while the larger host gets 
relatively little benefit from deduplicating in turn against 
the smaller one; in this case the host with the best overall 
deduplication matches the host with the poorest dedupli- 
cation, as a fraction of its total data. 

Lest there be a concern that there is a small num- 
ber of examples reflecting good deduplication, while 
the average is relatively low, there are other meaning- 
ful datasets with substantial overlap. For example, two 
VMDK files representing different Windows VMware 
virtual machine images used by an EMC release engi- 
neering group overlapped by 49% of the chunks in each. 


6.2.1 Sampling 


Our goal for sampling is to ensure that even with approx- 
imation, the system will find the same “best match” as 
with perfect data (a.k.a. the “ground truth’’), or at least a 
close approximation to it. We use the following criteria: 


e If ahost H had a significant match with at least one 
other host H, of 5% of its data, above and beyond 
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the “widely shared” fingerprints, we want the ap- 
proximated best match to be close to the ground 
truth. We define “close” as a window fB around 
the correct value, which is within either 5%, 10%, 
or 20% of the value, with a minimum of 1%. For 
example, if the ground truth is 50%, acceptable 
B = 5% approximations would be 47.5-52.5%, but 
if the ground truth is 5%, values from 4—6% would 
be acceptable. Note that if the estimated match were 
outside that range but H; was believed to be the best 
match, we might cluster the two together but misest- 
imate the benefit of the overlap. 


e If the best match found via approximation is with 
another host H>, rather than the ground truth best 
match, it may still be acceptable. The approximate 
overlap needs to be close to the actual overlap of A, 
or we would misestimate the benefits, but we only 
would find the alternate host H> acceptable if it was 
within B of the value of H,;. Thus the approximate 
match A) approx Must be > (1 — B)Hy and < (1+ 
B Hp). 


e If the host had no significant match (> 5%) with 
another single host, we want the approximation to 
reflect that. But again, a small change is accept- 
able. For example, if the best match were 4.5% and 
we would have ignored it, but the approximation re- 
ports that the best match is 5.5%, that is a reason- 
able variance. If the best match was 1% and is now 
reported as 5.5%, that would be a significant error. 


The ranges of overlap are important because in prac- 
tice a high relative error is inconsequential if the extent 
of the match is limited to begin with. If we believe two 
clients match in only 0.5% of their data, we are unlikely 
to do much differently if we estimate this match is 1% 
or 2%, or if we believe there is no match at all. On the 
other hand, if we think that a 50% match is only 25% 
or is closer to 100%, the assignment tool might make a 
bad choice. Even if it picks the right location due to the 
overlap, it will underestimate or overestimate the impact 
on available capacity. 

Figure 6 depicts the effect of sampling fingerprints, us- 
ing the same 21-client fingerprint collection. The x-axis 
depicts the sampling rate, with the left-most point corre- 
sponding to the ground truth of analyzing all fingerprints. 
As the graph moves to the right, the sampling rate is re- 
duced. There are three curves, corresponding to margins 
of error B = 5%, B = 10%, and B = 20%. The y-axis 
shows the fraction of clients with an error outside the 
specified B range. For a moderate margin of error there 
is little or no error until the sampling rate is lower than 

|_, though if one desires a tighter bound on , the error 


1024” 
rate increases quickly. 
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Figure 6: For a given sampling rate, on the x-axis, of + aN» 
we compute what fraction of clients have their biggest 
overlap approximated within an given error threshold f. 


6.3 Algorithm Comparison 


In general, any of the algorithms described in §4.2 work 
well if the system is not significantly loaded. As capac- 
ity or throughput limits are reached, however, the sys- 
tem can accommodate the greatest workloads through 
intelligent resource allocation. This is especially true if 
there is significant overlap among specific small subsets 
of clients. 

In our analysis here, we focus on capacity limitations 
rather than throughput. This is because backup stor- 
age appliances are generally scaled to match throughput 
and capacity, so it is rare to experience throughput bot- 
tlenecks without also experiencing capacity shortages. 
Since it can occur with high-turnover data (a good deal 
of data being written but then quickly deleted), the cost 
function does try to optimize for throughput as well as 
capacity. 


Incremental Assignment 


We first compare the four algorithms as clients and 
backup storage are repeatedly added, using the artifi- 
cial dataset described in 86.1 and new servers every 120 
clients. 

Figure 7 shows the results of this process with the 
number of clients increasing across the horizontal axis 
and cost shown on the left vertical axis. (Part (a) shows 
the full range of cost values on a log scale, while (b) 
zooms in on the values below 150, on a standard scale, 
to enable one to discern the smaller differences.) The two 
Capacity curves in 7(a) reflect the ratio of the estimated 
Capacity requirements to the available backup storage, 
with or without considering the effects of the best-case 
deduplication, and are plotted against the right axis. A 
value over | even with deduplication would indicate a 
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Figure 7: An artificial, homogeneous client population 
is added 20 hosts at a time, with new backup storage 
added every 120 hosts after the first 240. A small set 
of clients match each other with 30% deduplication and 
otherwise hosts of the same type match 10% of their data. 
The costs are shown by the curves marked on the left 
axis. The capacity requirements are shown by the curves 
at the bottom of the top graph, marked on the right axis. 


condition in which insufficient capacity is available, but 
any values close to or above | indicate potential difficul- 
ties. 


In Figure 7, the general pattern is for the “simple” 
algorithms to fail to fit all the clients within available 
constraints, once the collective requirements first exceed 
available capacity, while SA cycles between being able to 
accommodate the clients and failing to do so (but still be- 
ing an order of magnitude lower cost even when failing 
to fit them). There is a stretch between 600—700 clients 
in which it does particularly well; this is because in this 
iteration of the outer loop, the number of distinct clus- 
ters of highly overlapping clients equals the number of 
storage appliances, and the system balances evenly. 

While the sequence depicted in Figure 7 is a case in 
which explicit pair-wise overlap is essential to fitting 
the clients in available capacity, the sequence in Fig- 
ure 8 adds fewer clients per storage appliance. Clients 
almost always fit, though SA improves upon the other ap- 
proaches some of the time. As expected, RR is not quite 
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Figure 8: The same artificial, homogeneous client popu- 
lation is added 20 hosts at a time, with new backup stor- 
age added every 100 hosts after the first 200. The costs 
are shown by the higher curves, marked on the left axis. 
The capacity requirements are shown by the curves at the 
bottom of the graph, marked on the right axis. 


as good as BP; when the number of clients is high, there 
are cases where RR exceeds capacity because it consid- 
ers only whether a client fits and not how well it fits, and 
because it is constrained by earlier assignments. RAND 
similarly fails when 1000 clients are present. 

In summary, we find that under high load RAND, RR, 
and even BP fail to have acceptable costs in a large num- 
ber of cases, but SA shuffles the assignments to better 
take advantage of deduplication and fits within available 
capacity when possible. While the SA results overlap 
the BP results in some cases, whenever there is a pur- 
ple square without a matching aqua + overlaid upon it in 
Figure 7, SA has improved. 


Full-Content Client Dataset 


Here we describe the effect of assigning the 21-client 
dataset to a range of backup appliances. The overlaps of 
the datasets are derived from the full set of fingerprints 
of each client, but in the case of Host1, which is the host 
that is relatively small but has high overlap, we artifi- 
cially increase its backup sizes by two orders of magni- 
tude to represent a significant host rather than a trivially 
small one. Including this change, the clients collectively 
require 2.92TB before deduplication and 2.46TB or more 
after deduplication. They are assigned to 2-4 storage 
appliances with either 0.86TB (“smaller”) or 1.27TB 
(‘larger’) capacity each.? For the smaller servers, the 
clients take from about 70—140% (post-dedupe) of the 
available storage as the number of backup systems is 


>These numbers are taken from early-generation Data Domain ap- 
pliances and are selected to scale the backup capacity to the offered 
load. In practice, backup appliances are 1—2 orders of magnitude larger 
and growing. 
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Figure 9: Cost as a function of relative capacity, pre- 
dedupe, for the modified 21-host dataset, for two backup 
appliance sizes. Algorithms are either content-aware 
(CA) or not content-aware (NC). 


reduced, corresponding to 85—170% pre-deduplication. 
For the larger ones, they take 46-92% (deduplicated) or 
58—115% (undeduplicated). That is, even with dedupli- 
cation, at the highest utilization the clients cannot fit on 
only two of the smaller servers, but they fit acceptably 
well on the larger ones. 


Figure 9 shows the cost as a function of pre- 
deduplication utilization. For RAND and SA, it presents 
two variants: one, the content-aware version, is the de- 
fault; the other selects the lowest cost assuming there is 
no overlap, then recomputes the cost of the selected con- 
figuration with overlap considered. For BP and RR, over- 
lap is considered only to the extent that two clients are in 
the same class, and the adjustment is made after a given 
client is assigned to a server (refer to $4.2). 

Using smaller servers (9(a)), RR has a slightly higher 
cost under the lowest load; both RR and RAND (NC) are 
overloaded under moderate load, and all algorithms are 
overloaded under the highest load with just two servers. 
While it is not visible in the figure, SA without factoring 
content overlap into its decisions is about 6% higher cost 
than the normal SA which uses that information. 
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Using larger servers (9(b)), the costs across all algo- 
rithms are comparable in almost all cases. The notable 
exception is SA at the highest load: it 1s overloaded if it 
ignores content overlap, but fine otherwise. Interestingly, 
RAND does just as well with or without content overlap, 
as its best random selection without taking overlap into 
account proves to be a good selection once overlap is 
considered. 

In other words, at least for this workload, there are 
times when it is sufficient to load balance blindly, ig- 
noring overlap, and have overlap happen by happy co- 
incidence. But there are times when using overlap infor- 
mation is essential to finding a good assignment: an ap- 
proach that considers overlap can better take advantage 
of shared data. 


Large Customer Dataset 


We ran the assignment tool on the clients extracted from 
the largest customer backup trace, as described in 86.1. It 
has nearly 3,000 clients requiring about 325TB of post- 
deduplicated storage. Using 3 Data Domain DD880s to- 
taling 427TB, these use about 76% of capacity, and all 
four algorithms assign the clients with a low cost: the 
maximum is 0.80 for round robin, while BP and SA are 
0.24 and 0.23 respectively. It is worth noting that most of 
the ten RAND runs had costs over 2, but one was around 
0.4 and the best was identical to the BP result. SA took 
over four hours and only improved it from 0.24 to 0.23. 

What about overload conditions? If these were just 2 
DD880s (285TB), the average storage utilization goes to 
114% so no approach can accommodate all the clients. 
Even so, the cost metric 1s a whopping 184K for BP, 
183K for RR, and 159K for RAND (which, by taking the 
lower costs of client overlap into account when com- 
paring alternatives is able to find a slightly better as- 
signment). These high costs are dominated by the “fit 
penalty” due to about 130—160 clients, out of 2983, not 
fitting on a server. SA, however, brought the cost down to 
25K (of which 12K is from 12 clients not fitting). How- 
ever, it did this by running for 5.5 cpu-days (see the next 
subsection). 

Obviously one would not actually try and place 3,000 
clients, totaling 325TB of post-dedupe storage, on a pair 
of 142TB servers. This example is intended to show how 
the different approaches fair under overload, and it also 
provides an example of a large-scale test of SA. The large 
number of clients to choose from poses a challenge, in 
that a cursory attempt to move or swap assignments may 
miss great opportunities, but an extensive search adds 
significant computation time (see the next subsection). 
Tuning this algorithm to adapt to varying workloads and 
scales and deciding the best point to prune the search are 
future work. 
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Figure 10: Cost as a function of simulated annealing 
analysis time for several cases. Both axes use a log scale. 
Except for the right-most points, any points that appear 
within a factor of 1.5 in both the x and y values of a point 
already plotted are suppressed for clarity. 


6.4 Resource Usage 


While our results have shown that SA can produce better 
assignments than the other algorithms in certain cases, 
there is a cost in terms of resource requirements. All 
three “simple” algorithms are compact and efficient. 
For example, the unoptimized Perl script running bin- 
packing on the nearly 3,000 clients and two small servers 
in the preceding subsection took 163M of memory and 
ran in 23s on a desktop linux workstation. Running SA 
on the same configuration took over 5 days, and the com- 
plexity of the problem is only increased when pair-wise 
rather than per-class overlaps are included. For the itera- 
tive problem with up to about 1,000 clients and pair-wise 
overlaps, the script takes several gigabytes of memory 
and runs for over a half day on a compute server. 

Figure 10 shows timing results for five examples of 
earlier experiments. Two are the large-scale assignments 
described in the previous subsection, with nearly 3,000 
clients that either fit handily or severely overload the 
servers. The horizontal line at the bottom represents the 
case where SA runs for over four hours with no effective 
improvement over a cost that is already extremely low. 
The curve toward the top with open squares is the same 
assignment for 5 of the server capacity. SA dramatically 
reduces the cost, but it is still severely overloaded. The 
curve (with open triangles) near that one represents one 
of the incremental assignment cases in which the system 
is overloaded regardless of SA, while the one just below 
that has 20 more clients but one additional server and, 
in the case of SA, has a relatively low cost after a long 
period of annealing (the sharp drop around the 10-hour 
mark is an indication of SA finally succeeding in rear- 
ranging the assignments to fit capacity). Finally, the re- 
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maining triangle curve represents a smaller test case in 
which the cost starts low but SA improves it beyond what 
BP initially did. 

In some cases (not plotted), there is a drop followed 
by a long tail without improvement. Ideally the process 
would end after a large score decrease if and only if no 
substantial decreases are still possible; since there is the 
potential to miss out on other large improvements, we 
let SA continue to search and hope for further decreases. 
Generally, with our default parameters, SA runs for sec- 
onds to hours on a desktop computer, but when config- 
uring or updating a backup environment, that is not un- 
reasonable, and the “best solution to date” can be used at 
any time. The more excess capacity there is, the easier 
it is for SA to hone in on a good solution quickly. For 
assignments of thousands of clients in an overloaded en- 
vironment, some sort of “divide and conquer” approach 
will be necessary to keep the problem manageable. 


7 Variations 


In this section we discuss a couple of variations on the 
policies previously described: “forgetting” assignments 
and biasing in favor of small clients in the cost function. 


7.1 Forgetting Assignments 


As described to this point, whenever new clients are 
added to an existing set of assignments, the first assign- 
ments are “carved in stone” for the simple algorithms: 
they cannot be modified, and only the new unassigned 
clients can be mapped to any server. The SA algorithm 
is an exception to this, in that it can perturb existing as- 
signments in exchange for a small movement penalty. 
Here we consider a simple but extreme change to this 
policy: ignore all existing assignments, map the clients 
to servers using one of the algorithms, and pay move- 
ment penalties depending on which clients change as- 
signments. When the assignment that takes previous as- 
signments into account does not cause overflow, start- 
ing with a clean slate usually results in a higher cost be- 
cause the movement penalties are higher than the other 
low costs from the “good” and “warning” operating re- 
gions. But when there would be overflow, it is often the 
case that rebalancing from start avoids the overflow. 
Figure 11 repeats Figure 7(a), with one change: for 
RR, RAND, and BP, each point is the minimum between 
the original datapoint and a new run in which the previ- 
ous assignments were ignored during assignment.* Ig- 


“Due to the high cost of SA, we do not re-run each SA experiment 
but instead take the minimum of the SA run and the “forgotten” BP run; 
that is, SA could have started from the lower BP point rather than the 
previous one that considered previous assignments. It might improve 
the cost beyond that point, something not reflected in this graph. 
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Figure 11: The same clients and servers are assigned as 
in Figure 7(a), but previous assignments can be ignored 
in exchange for a movement penalty. 


noring initial assignments improved the cost metric in 
35% of the cases overall, and in 43% of the cases in 
which the cost was over 1000 (indicating significant 
overload): it is frequently useful but no panacea. 

The most notable difference between Figure 7(a) and 
Figure 11 is in the range of 600-700 clients. Previously 
we noted that SA does especially well in that range be- 
cause of overlap, but if BP and RR start there with com- 
pletely new assignments, they too have a low cost due to 
keeping better deduplicating clients together. 


7.2 Counting Overflow 


As described, the cost function biases in favor of large 
clients: it assumes that it is more important to back up 
a larger client than a smaller one, so it removes clients 
in order of size, smallest first, to count the number of 
clients that do not fit on a server. This approach is in- 
tuitive, in that a large client probably is more important 
than a small one, and it also simplifies accounting be- 
cause if clients are added in decreasing order of size, we 
can remove a small client without affecting the dedupli- 
cation of a larger one that remains. 

An alternative cost function would minimize the num- 
ber of occurrences of overflow by removing the largest 
client(s) to see if what remains will fit. This has the effect 
of minimizing the extra per-client penalty while still pe- 
nalizing for exceeding the capacity threshold. In essence, 
it encourages filling N-1 servers to just below 100% uti- 
lization, then placing all the remaining (large) clients on 
the N" server. 

Figure 12 compares the smallest-first and biggest-first 
penalties for the example used in Figure 7, modified to 
exclude the pair-wise 30% deduplication of specific com- 
binations of clients. (This is because recomputing the 
impact of removing a client against which other clients 
have deduplicated would require a full re-evaluation of 
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Figure 12: The same clients and servers are assigned as 
in Figure 7, but deduplication is only considered within a 
class (big, medium, or small) rather than having greater 
deduplication for specific pairs. Crit is computed by re- 
moving the (a) smallest or (b) largest clients first. 


the cost function, compared with class-wise deduplica- 
tion, and has not been implemented.) The two graphs 
look quite similar, but because of the change to the value 
of Ci; the peak values are about one order of magnitude 
lower when the largest clients are counted. There is no 
qualitative difference in this example beyond a narrowed 
gap between SA and the other approaches. 


$ Discussion and Future Work 


Assigning backups from clients to deduplicated stor- 
age differs from historical approaches involving tape be- 
cause of the stickiness of repeated content on the same 
server and the ability to leverage content overlap between 
clients to further improve deduplication. We have ac- 
counted for this overlap in a cost function that attempts to 
balance capacity and throughput requirements and have 
presented and compared several techniques for assigning 
clients to backup storage appliances. 

When a backup system has plenty of resources for the 
clients, any assignment technique can work well, and 


there is little difference between RAND and our most ad- 
vanced technique with SA. The more interesting case 
is when capacity requirements reach beyond 80% of 
what is allocated. We have found that RAND and RR 
tend to degrade rapidly, while bin-packing and SA con- 
tinue to maintain a low cost until capacity becomes over- 
subscribed. In cases of significant overlap, SA is able to 
use client overlap to increase the effective capacity of a 
set of deduplicating backup servers, deferring the point 
at which the system is overloaded. 


There are a number of open issues we would like to 
address: 


e evaluation of overlap in a wider range of backup 
workloads 

e evaluation of overlap beyond the “best match” for 
those cases where cumulative deduplication beyond 
one other host is significant 

e full integration between client assignment and 
backup software 

e use of the assignment tool to manage transient 
bursts in load due to server failures or changes in 
workload 

e additional evaluation of the various weights and 
cost function 

e optimization of the SA algorithm for large-scale en- 
vironments; and 

e additional differentiation of clients and servers, for 
instance to route backups to different types of de- 
vices automatically depending on their update pat- 
terns and deduplication rates. 


Efforts to integrate content affinity with pre-sales sizing 
are already underway. 
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Getting to Elastic: Adapting a Legacy Vertical Application 


Environment for Scalability 
Eric Shamow - Puppet Labs 


ABSTRACT 
During my time in the field prior to joining Puppet Labs, I experienced several 
scenarios where I was asked to be prepared for so-called “elastic” operations, which 
would dynamically scale according to end-user demand. This demand only intensified as 
the notion of moving to IaaS became realistic. There's no button you hit marked "make 
elastic" to turn your infrastructure into an elastic cloud...rather you need to come to an 
understanding both of the technologies your organization uses, its tolerances for latency 
and downtime, as well as your platform, to get there. This paper discusses the key areas 
that must be addressed: organizational culture, technical policy development, and 


infrastructure readiness. 
Introduction 


As I’ve moved through the industry, it’s 
become increasingly common to find 
organizations operating what might be termed 
an “internal cloud” - a commodity hardware 
infrastructure front-ended by VMware, Xen, or 
another virtualization technology, being used to 
cushion the need for rapid and varying server 
deployments. Over the past few years, I have 
seen increasing interest in outsourcing that 
Operation - in moving to external cloud 
offerings including IaaS. In most cases, I've also 
needed to become prepared for elastic 
expansion of our apps as we modify them to 
scale out rather than up. 

I encountered many of these problems 
during the time I spent as Manager of the 
Systems Operations group at Advance Internet. 
Advance is a mid-size company in the 
publishing field, running approximately 1050 
servers in a local, private cloud. Although I left 
Advance prior to the full implementation of our 
elastic solution, I was deeply involved in the 
architecture and implementation of that 
solution, and was fortunate to learn valuable 
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lessons about how to take an entrenched static 
environment into a dynamic one. 

There's no button you hit marked "make 
elastic" to turn your infrastructure into an elastic 
cloud...rather you need to come to an 
understanding both of the technologies your 
organization uses, its tolerances for latency and 
downtime, as well as your platform, to get there. 
Advance traveled some of this road, and this 
report will include both information about the 
solutions we found, and some recommendations 
for those attempting to do the same. 


Characterizing the Problem 

In order to consider what will be necessary to 
“oo elastic,’ we must first evaluate what that 
phrasing really means. How elastic do we want 
to be? What parts of our applications are able to 
scale easily? What parts do not? What elements 
of our process or infrastructure make automatic 
expansion impossible? In short, what do we 
need to know”? 

At Advance, in examining our 
environment, I identified five major questions 
or issues that would be show-stoppers for us 
implementing any kind of scalable environment: 
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1) Our servers and applications could not be 
deployed without human intervention. 
Documentation was limited and there was no 
automation available. 

2) We had no information available about when 
to deploy a new server automatically. There 
was a mandate to be able to expand 
dynamically, but no information about what 
that meant. 

3) Similarly, we did not know when to 
automatically retire a new server. How 
responsive to increases and decreases in load 
would we need to be? 

4) What was to be the mechanism for the 
automatic deployment and retirement? 

5) Were our applications optimized to take 
advantage of this type of scaling? In several 
cases our experience was that performance 
improvement was not a linear correlation 
with an increase in server count - and in fact 
that in some cases increasing parallelism was 
damaging to performance. We would need to 
determine which applications would need to 
be refactored to handle this architecture, and 
which were prepared to handle it natively. 


In any environment facing similar issues, 
the five listed above will form the core of the 
matter - the remainder of our internal fact- 
finding extended naturally from the answers we 
found and the process we underwent in 
attempting to determine those answers. 

For those undergoing the same 
exploration, this fact-finding exercise will form 
the groundwork for all future work in this space. 
This means that truthful responses and openness 
are absolutely necessary. The teams involved 
don’t need to agree on a solution yet, but 
without a common understanding of the 
problem space, we cannot reasonably determine 
whose concerns or enthusiasm are justifiable. It 
can often help to present this as an opportunity 
to air long-unaddressed concerns in a new way. 
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If the application team distrusts elasticity, 
encourage them to fully explain and justify 
those concerns and promise that they will be 
addressed as part of the proposed solution. 
Getting everyone to cooperate here is the most 
critical step of the process. For me, getting to 
elastic meant a lot less engineering than I 
expected, and a whole lot more PR, meetings, 
and assuaging of concerns. 


Elasticity Means Automation 


The first key recognition about elastic 
expansion 1s that by definition, 1t means that the 
server provisioning process must be automated. 
This is a bridge that many organizations have 
yet to cross. In some cases the deployment 
process itself may be automated, but post-install 
configuration is not completed automatically. 
My own findings gathered from the 
organizations I have observed - and this was the 
case at Advance as much at others - are that 
most installation and configuration procedures 
are not automated because groups do not have 
clear and stable procedures that are followed for 
deployment. Whether this is because 
deployment teams do not maintain regular 
standards for system configuration, or because 
development teams do not provide accurate 
release notes or cleanly packaged applications 
ultimately comes down to finger-pointing; the 
organization as a whole must recognize that if it 
wants elasticity, it will need automation, and 
automation requires clarity of purpose and 
requirements, and stability of procedures. 

Automation itself has multiple 
components, and depending on the breakdown 
of roles and responsibilities within an 
organization, these components are often 
managed by different groups. Infrastructure 
groups will have concerns about provisioning 
storage and network; OS groups will worry 
about package repositories, OS versioning, and 
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configuration management; application groups 
will focus on updating application-specific 
configurations to recognize new or removed 
members of a cluster, reshuffling data that has 
been partitioned based on previous cluster size, 
and changing various application settings to 
properly tune performance. All of these are 
critical and should be clearly mapped. 

Where possible, inquiry into how they 
affect each other is worth discussion - does 
repartitioning our data suggest different OS 
configs? With the new cluster size, should we 
alter our load balancer configuration? However, 
don’t let these advanced discussions derail the 
primary goal of understanding how your 
systems are provisioned. The second-level 
analysis of how those systems interact will 
occur naturally during the design and 
implementation of your process, and should 
continue to iterate through its lifecycle. The 
most important thing is to come to an 
understanding of those manual processes which 
are not currently automated. Those manual steps 
are your hard roadblocks on the way to 
elasticity. 

Ultimately, at Advance, we settled on a 
toolset of Kickstart for OS deployments, 
managed through Cobbler for the additional 
repository and profile information it permitted. 
We then handed off to Puppet for application 
installation and configuration, having worked 
closely with the application teams to build 
Puppet manifests that handled their applications 
appropriately. On the infrastructure side, the 
SAN, network and VMware team decided to 
manually script their deployment, resulting in a 
tool called vDeploy. I will discuss this tool later 
on in the paper. Ultimately, the tools you choose 
should be based on two factors: your own 
comfortability with them, and their flexibility to 
work well together and to integrate with each 
other. It is not always critical to choose the best- 
of-breed software, but rather to choose the 


software that best fits you and your 
organization. 


Elasticity Requires Open Metrics 


An additional component to expanding 
and contracting an environment in an automated 
fashion is that accurate and relevant metrics 
about that environment must be available. In 
order for those metrics to be meaningful for 
elasticity, they must be reliable and 
comprehensive enough that an unattended 
system can make bottom-line decisions based 
on them: should I deploy or remove a live 
system from my customer-facing site 
immediately? This means that the metrics 
cannot be siloed as many IT reporting 
infrastructures are, but must reflect both the 
state of the application infrastructure as well as 
the applications running on it. These metrics 
must also be reliable: they must not be 
inaccurate, fudged, or intermittently available 
because of an individual group’s desire to hide 
information from the rest of the team. Elastic 
expansions and contractions affect the whole 
without human intervention, but by definition 
this process is naive - it can only know what we 
tell it. If we lie to the system, the system will 
make poor choices. 

The choice of metrics should also reflect a 
cross-disciplinary approach. Much is lost in IT 
monitoring because of a lack of communication 
between groups. A monitoring team will pride 
itself on implementing trend lines for disk 
utilization, but will fail to monitor a change in a 
transaction rate or size easily exposed by the 
monitored application itself. These metrics can 
predict an increase in the rate of growth at a 
time when the change would only appear to be a 
statistical anomaly in the storage data. Again, 
the discussions of these interrelationships will 
evolve from the discussions and 
implementations you are implementing here, 
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and we shouldn’t hesitate too long attempting to 
nail them down early. That said, any 
understanding we can get about 
interrelationships between the components in 
our environment helps us better predict future 
changes. Better prediction means better 
automation, which means elasticity that’s less 
likely to break. 

At Advance this was a major source of 
contention. Monitoring was highly siloed, with 
Systems controlling an array of Cacti, PNP, 
MRTG, and proprietary VMware, 3par, and 
NetApp applications to monitor and graph data - 
in fact, even within systems, monitoring was 
siloed, split between different implementations 
in the DBA, infrastructure, and operations 
spaces. Application development staff often 
maintained off-the-radar monitoring systems 
stashed on workstations or quasi-production 
servers. The metrics from these groups were 
never aggregated, and much time was lost 
bouncing requests and information back 
between multiple people who were hesitant to 
allow access to - or knowledge of the existence 
of - their proprietary systems. 


Openness Requires Culture Change 


If the organization preparing to implement 
a model based on elastic expansion 1s not in the 
state needed to gather the information above - 
with a clear availability of infrastructure, OS, 
and application-level metrics across the board, 
honest communication between groups and 
well-documented deployment and configuration 
changes, elastic expansion is unlikely to be 
possible. These steps are all pre-requisites for 
technological change, but they themselves are 
less technological than cultural. If organizations 
are going to be prepared for elasticity - 
operating at a minimum cost most of the time 


but prepared for the huge onrush of traffic 
caused by an article “going viral” or the sudden 
success of their service!, they must address the 
underlying lack of transparency before they can 
begin to work on the technical challenges. 

In reality, getting this to happen is often 
the hardest part of the process. It 1s fortunate if 
the change is being implemented in a top-down 
manner, in that if management is mandating the 
change, it is often willing to enforce that 
mandate by requiring teams to cooperate. But 
what if the change isn’t mandated? 

In my own experience, the best approach 
is two-pronged. The first prong is to establish 
the missing communication. As the head of an 
Operations team, I regularly met with the head 
of Development teams, including those of small 
development groups that my predecessors had 
often ignored. I wanted to know their pain 
points, where Operations was letting them down 
or frustrating their work. Establishing this 
communication was key to establishing trust. 

Trust, however, does not come through 
words but through deeds. The best action I 
found I could take in this regard was to 
surrender unilaterally. I might not be able to get 
developers or infrastructure to share everything 
with me, but I would share everything with 
them. Every incident was clearly documented, 
metrics were available to all teams, and we 
developed a process for requesting the addition 
of new metrics. I committed to making these 
newly-requested metrics available to them with 
an response time based on severity, reaching 
from 20-30 minutes during a crisis, to a 
maximum of 48 hours outside of one. 

I also worked hard to develop a 
professional chain of command-based 
communication system with development 
managers. This may not be applicable in all 
engineering environments - in many having all 
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discussions on a public list is part of the fabric 
of their work culture. But it can also result in 
decisions made based on ego and pride rather 
than technical judgment. Being called out on an 
error or disagreement in public forces a 
different type of response from a concern 
brought quietly in private. At Advance I 
committed to bring development concerns to the 
relevant managers and help triage my team’s 
issues rather than exposing them on our internal 
IRC channels and mailing lists, and asked the 
development managers to do the same. The 
ratcheting-down of public tensions combined 
with the daily give-and-take of triaging 
priorities with the other managers aided greatly 
in establishing an understanding of other teams’ 
needs and willingness to cooperate. 


Getting Things Started 


We’ve now established communication 
between departments, established some baseline 
metrics that we need to pay attention to, and 
defined clearly the expectation that server 
rollouts and retirements - from the bare metal 
phase to appearing in a user-facing cluster - 
should be automated. Now we’re ready to do 
some work. But where to begin work? 

For the purposes of this paper, I will 
assume that metric collection systems are 
already available to you, and that you need only 
tune your existing system to provide you the 
agreed-upon information. There are a variety of 
tools excellent at collecting and displaying raw 
data - from the simplicity of MRTG to more 
complex tools such as Munin or Cacti, and 
newer distributed tools such as Graphite or 
Ganglia. The use of one or more of these will 
depend on your data sources and the familiarity 
of your teams with the tools in question. My 
team used a mix of Cacti and PNP4Nagios, 


although we were strongly looking into 
Graphite as a replacement. 


Finding Meaningful Metrics 


Assuming that we have monitoring 
technology in place, the next obvious question 
is “what do we measure?” The answer to this 
question may at first seem obvious to 
stakeholders on all sides of the discussion, but a 
quick synchronization of expectations often 
indicates that each group’s answer 1s different. 
The infrastructure and OS groups will tend to 
monitor metrics focused on the performance of 
the system itself such as processor load, 
memory availability, I/O throughput, CPU 
percentage (distinct from load, which really 
measures queue length - a distinction lost on 
many involved in resource monitoring)’, and 
swap usage. 

In the meantime, the application team will 
likely be focusing on internal data points that 
reflect the actual capacity of the application 
itself, identifying performance of key areas of 
code, headroom left in caching applications 
such as Memcache or Varnish, and other data 
points that reflect how pieces of the code are 
relating to each other. If there is a separate 
business owner with access to a dashboard or 
metrics, that person or group is likely 
examining more vanilla performance stats - for 
a web application, time for first byte download, 
hits per second, and so forth. 

It is very likely that none of these metrics 
will give you on its own the answer that 
indicates at what point your application will 
need to elastically expand. In fact, it is likely 
that, until this point, any discussions about non- 
elastic expansion have involved meetings 
between several stakeholders to review this data 
and find ways to optimize on existing hardware. 
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Finding the right formula is an exercise in 
looking at aggregate data patterns, finding 
correlations that seem to reliably suggest the 
need for additional servers, and then regularly 
re-examining those metrics as the application 
and hardware profiles change. 

The worst mistake you can make at this 
point is assuming that you know or understand 
too much about your application or 
environment. What was true several months ago 
may not be true now...a feature in the 
application that caused an I/O bottleneck six 
weeks ago may have been rectified in the 
application code four weeks ago, and now 
you’ve hit a CPU limit on your storage device. 
Assumptions about causes are more likely to 
cause bad interpretation of data, which in turn 
are more likely to cause a misunderstanding of 
what criteria will need to be used for automated 
scaling. So the important part of this stage is to 
have a fresh discussion about application 
performance and possible bottlenecks at all 
levels - an informed discussion, but one that 
makes no assumptions and thoroughly re- 
examines every facet of the environment 
looking for hidden indicators and bottlenecks. 
You won’t find them all, but application, 
operations and business people together will 
find a lot more than any of those three alone. 
This is what DevOps looks like in practice. 


The Shifting Landscape 


Before moving on to the next stage of 
enabling the elastic environment, I want to 
return to a phrase used just a few paragraphs 
back: “What was true several months ago may 
not be true now.” 

While this will always be the case is fast- 
moving, multifaceted IT environments, what 
should not be the case is that any of this 


changing truth should be undocumented, or 
worse a complete surprise to all but one or two 
people. In a field with rampant hyper- 
specialization with limited training budgets and 
one or two “experts” in a given technology per 
group, it is almost inevitable that sub-pockets of 
activity have developed which are at least 
partially invisible, even to members of that 
pocket’s own team. 

This type of change 1s absolutely toxic to 
elastic expansion. Since all of the painstaking 
research and rule development you are doing is 
based around a shared understanding of the 
environment, changes to that environment that 
are not automated make it impossible to deploy 
a single additional node without manual 
intervention. For that reason, change 
management must be implemented for an elastic 
environment to succeed. 

This may sound like a leap, but if you 
examine the nature of elasticity, the reasoning 
becomes clear. Elasticity is essentially a set of 
rules wrapped around automation -- a set of 
conditions under which automated procedures 
should take place. Automation itself is really 
nothing more than a form of machine-parseable 
and actionable documentation - we are taking 
yesterday’s run book or wiki doc and turning it 
into a YAML file, but in the end we are writing 
documentation about how a system should be 
configured, and then using an application to 
verify compliance with that document. 

Note that I did not say that change control 
was needed - merely change management>. As 
long as the changes made are compatible with 
the rest of the operating environment and do not 
interfere with its operation, those changes can 
be submitted without review. Whether it 1s wise 
to do so is a different matter, but don’t attempt 
to bite off more than you can chew here - the 
framework for change management can be 


3 http://www.technologyexecutivesclub.com/Articles/nanagement/artChangeControl.php 
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expanded to include change control later on. For 
now, the important thing is that any change that 
would affect the ability to automatically rebuild 
a system is made part of the server and app 
deployment processes. 


Policy 


Even when using clearly defined metrics 
to signal the need for expansion, there are 
additional factors to consider. First, we must 
consider the statistical anomaly. If you are 
running a website, you don’t want to scale to a 
thousand machines because a web crawler hit 
your site and began to index, or because a user 
wrote a bad script to fetch your page every 
millisecond. Similarly, we must consider how 
long it takes for a new server to come up. 
Depending on the nature of your environment, 
this can be very tricky. If load increases sharply, 
you may need a new server in under a minute. 
Even with well-automated deployment, a large 
database server can take five or more minutes to 
power up and build. If this is not fast enough to 
save your application from falling over, we have 
missed the point of the elastic expansion. 

The reverse 1s also true. If load drops to 
nothing because of an ISP failure, we do not 
want our production cloud to shrink to its 
minimum size. We also don’t want to power 
down servers we think we may need again in a 
few seconds or minutes. 

It is the rules around making these 
determinations that I refer to as “policy” - not a 
formal organizational policy, but rather an 
internal technical policy explaining when and 
how fast you expand, when and how fast you 
contract, what the artificial limits to both of the 
above operations should be, and how we work 
around the elements of those operations that 
don’t fit our environment. 


* http://pulpproject.org/ 


There is no formula that can be generated 
for this outside of an examination of your own 
application’s behavior and the metrics you 
should now be gathering. As an example, 
however, I can discuss the type of solutions we 
had envisioned at Advance. 

For the particular example of database 
servers, we looked at a combination of server 
load, database server queue length, and slow 
query information from the database server, and 
latency and queue information from the 
application side, to determine that a new 
database server was needed. However a new 
database server could take in excess of ten 
minutes to provision, far too long to resolve a 
sudden explosion of activity. 

Advance’s solution was to mandate that, 
depending on cluster size, one to two database 
servers would be provisioned and immediately 
powered down as part of our cluster at 
minimum size. Every time new database servers 
were automatically provisioned, 1-2 extra 
servers would be provisioned and immediately 
powered off. When the need came for new 
servers, we could begin provisioning additional 
servers but simultaneously power up the 1-2 
idle servers, providing relief to the application 
within a minute, while additional resources 
came on line. We employed this strategy in 
reverse while shutting systems down, 
decommissioning them but always leaving 1-2 
systems powered down but not destroyed. 

We also decided to implement several 
caps on growth and decommissioning to hedge 
against the possibility of failures in our metrics 
and formulas. We only allowed growth to 
proceed at a limited rate, controlling the 
maximum number of servers that could be 
provisioned per 15-minute period, and setting a 
maximum limit on the number of machines that 
could be auto-deployed without administrator 


LISA 711: 25th Large Installation System Administration Conference — 175 


intervention. We set similar limits on 
decommissioning. 

This strategy works well for a “naive” 
application, where application servers are not 
aware of each other and can scale out 
horizontally. This is not the case for most 
applications, particularly in-house ones which 
have been written to scale vertically - requiring 
more resources such as RAM and CPU - rather 
than horizontally. As a result, many of these 
apps will not see a linear improvement as each 
server 1S added, and it is possible to see a 
diminishing return, and eventually even a 
negative impact from the addition of more 
servers. While an application rewrite down the 
line should help this, it’s almost never 
immediately possible; rather, you should tailor 
your expansion policies to fit the characteristics 
of the application you have, while encouraging 
your development teams to begin thinking in 
terms of horizontal rather than vertical resource 
usage in the future. 

There is an additional concern - 
application servers which must remain aware of 
each other - which we will return to after a 
discussion of the necessary remaining 
components of the elastic toolset. 


Getting the Infrastructure Ready 


For the purposes of this discussion, I will 
assume that the reader is functioning in a 
“cloud”-type virtualized environment. It 1s 
possible to scale elastically in a hardware 
environment, but the complexity level is much 
higher. While implementing this system, I was 
working with an internal cloud built on 
VMware vSphere, with Infoblox providing 
DNS and DHCP and Cobbler for provisioning 
and repository management. 

The key infrastructure elements needed to 
support this are as follows: 
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¢ Network support - your network devices 
must support servers being brought up in a 
variety of subnets. In a virtualized 
environment, this typically means that the 
appropriate networks are available to the 
virtual switches used for provisioning. 
Depending on the size of your environment 
and complexity of your network layout, you 
may need to do additional work on the virtual 
switch side and VM controller configurations 
to ensure that new servers are brought up on 
servers with access to the appropriate subnets. 
At Advance, where nearly all subnets were 
available to all VMs for provisioning, this was 
vastly simplified; in most organizations 
however this is not the case. 

Network service support - either pre- 
provisioned static IP addresses for new 
servers with appropriate ports provisioned, or 
DHCP. Since most bare-metal configuration 
requires DHCP and PXE booting capability, 
having both will make your life much easier. 
If a subnet fills up, your auto-deployment 
tools should be robust enough to capture and 
handle that error, even if only by paging an 
admin to resolve the problem. One of the 
reasons the Infoblox was terrific for this 
deployment was the ease of access to its 
DHCP interface for both querying of available 
addresses and provisioning of reserved 
addresses. 

DNS readiness for automated deployment. 
This means that your DNS zones should be 
laid out clearly, with reasonable reverse- 
mapping of IP addresses, so that automated 
provisioning is straightforward. The system 
needs to know what IP address to assign based 
on system role. 

Appropriate connectivity to build 
environments. You must have the bandwidth 
to push down OS images and patch data to 
multiple servers quickly. 
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e API or command-line access to your 
virtualization platform which will enable you 
to create new VMs, grab their MAC 
addresses, and hand information about them 
to your bare-metal deployment system. 
VMware is shaky in this regard, but it 
provided enough access for us to comfortably 
do what we needed. 

Automated OS licensing. If you need to enter 
a username and password at the console and 
that information can’t be stored in an answer 
file, elastic expansion 1s a no-go. 

Automated patch management. This is often 
overlooked, but it’s very important that a 
server brought up today look like one that was 
brought up last week. If we install an OS, 
even from the same image, but then run an 
update against current package repositories, 
our server today may have a very different set 
of packages from the server deployed last 
week. So it is important that all servers talk to 
the same repository set, with the same 
package version information across the board. 
We were struggling with this when I departed 
Advance, but had identified the Pulp project 
as a possible solution. 


OS and Application Deployment 


Your OS deployment choices will be 
largely shaped by your OS choice. As a CentOS 
environment, we used Cobbler for system 
deployments. There are a multitude of 
alternatives - Foreman, Spacewalk, or even 
hosting kickstart files on a regular webserver. 
The important thing is that the deployment 
system be able to identify a host and hand it the 
appropriate base configuration. Your OS install 
should be generic and minimal; don’t try to 
handle 50 gold master images, but rather let 
your configuration management tool handle the 
heavy lifting. 


At Advance, we chose Puppet as a 
configuration management system, and as I 
have since left Advance to work for Puppet 
Labs, my preferences are clear. However using 
any tool in this space puts your organization 
light years ahead of most of 1ts competition. The 
key 1s not which configuration management tool 
you use, but the discipline to stick with that tool 
and keep everything in configuration 
management. Remember that, as discussed 
earlier, if it’s not in configuration management, 
it can’t be deployed automatically. 

At this point I will return briefly to the 
concept of clusters that are not a collection of 
naive servers, but which must be aware of their 
own configuration or of each other. 
Configuration management provides the 
solution for this. Servers can be assigned 
environments or variables based on their 
intended role or position in a cluster, and 
configuration files can be templatized based on 
that information. In Puppet, we can use 
Exported Resources to ship dynamic 
information out of nodes to a shared datastore, 
so that other nodes can learn about them and 
make decisions. With proper scripting and 
policies, we can repartition our data sets in what 
is now a Self-aware, elastically growing cluster. 


Ad Hoc Administration 


There are circumstances in any IT 
environment that don’t fit well into the 
paradigm of change/configuration management. 
Suppose we want to kick all the Apache servers 
in a particular datacenter, or remount NFS 
volumes attached to a storage device that went 
belly-up? 

The old solutions were SSH in a for loop, 
and ClusterSSH, which displays multiple 
terminals and allows a user to control them all 
simultaneously. Newer tools in this space 
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provide more accountability and control and 
better reporting. 

At Advance we were using the Marionette 
Collective, or MCollective, for a few months 
when Puppet Labs acquired it, cementing our 
choice. Whether you using MCollective, func, 
fabric, Knife, or any other tool the important 
thing is that ad hoc administration should be 
compatible with your change management 
environment. If changes in one disrupt the 
other, automation will break. Many of these ad 
hoc tools force you into writing clients or 
carefully-wrapped agent scripts, something seen 
as an inconvenience. But there’s a reason for 
this: we want to be able to execute something in 
a controlled period of time and then aggregate 
and return the results in a meaningful way. We 
can then store and report on the results and even 
audit the activities of the people using the tools. 

The more centralized and automated this 
solution, the less likely it 1s to have unexpected 
impact on the managed environment. If we take 
the SSH in a for loop example - if we run that 
loop against 1500 servers, who is going to parse 
the results to notice that server 650’s response 
didn’t quite look right? And if it didn’t, will the 
next round of changes cause server 650 to 
diverge even further from the remaining 1499? 
Tools with built-in auditing and data 
summarization can find these issues before they 
become problems or unexplained application 
behavior. 


Where To Next? 


I was saddened to leave Advance before 
we actually went elastic in production, but we 
had all the groundwork in place, thanks to the 
work of our infrastructure team’s construction 
of their vDeploy tool, which interfaced with our 
VMware, DNS and DHCP environments to 
deploy new servers, then handed off to my 
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Operations team’s Cobbler and Puppet 
environments. 

The workflow was that our Nagios-based 
monitoring system would trigger vDeploy only 
if the appropriate business criteria were met, 
causing vDeploy to build a new host based on 
information passed from Nagios. The concept of 
doing this sounded unthinkable at the start of 
the design process, but after analyzing the 
problem, it became clear that technologically, 
there were very few hurdles. Most applications 
and environments have APIs or RESTful 
interfaces that can be used for this sort of 
communication, and writing these scripts was 
simply a matter of putting in the work. 

The actual complexity lay in building the 
application and business rules around when 
these things should happen. Focusing on 
communication and shared information rather 
than the engineering details proved to be the 
key. Good engineering and technology selection 
is key but is made much easier by taking the 
time to understand the business logic that these 
engineering exercises are designed to satisfy. 
While the impulse of many engineers is to jump 
in and start coding, taking the time to 
understand and manage the underlying cultural 
and infrastructure issues can turn development 
of an elastic environment from a seemingly 
insurmountable series of roadblocks to an 
exercise in small-scale script development. 
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Scaling on EC2 in a fast-paced environment 


Practice and Experience Report 
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Nicolas Brousse, Lead Operations Engineer, TubeMogul, Inc. 


Email: nicolas@TubeMogul.com 


Abstract — Managing a server infrastructure in a fast- 
paced environment like a start-up is challenging. You have little 
time for provisioning, testing and planning but still you need to 
prepare for scaling when your product reaches the tipping 
point. Amazon EC2 is one of the cloud providers that we 
experimented with while growing our infrastructure from 20 
servers to 500 servers. In this paper we will go over the pros 
and cons of managing EC2 instances with a mix of Bind, LDAP, 
SimpleDB and Python scripts; how we kept a smooth working 
process by using NFS, auto-mount and shell-scripting; why we 
switched from managing our instances based on tailor-made 
AM I/Shell-scripting to the official Ubuntu AMI, Cloud-init and 
puppet; and finally, we will go over some rules we had to follow 
carefully to be able to handle billions of daily non-static http 
request across multiple Amazon EC2 regions. 


Index Terms - Amazon’ EC2, 
tTubeMogulolerance, infrastructure, DevOps. 


scalability, fault- 


I. WHAT IS AMAZON EC2 AND HOW DOES IT WORK? 


Amazon AWS! provide a wide range of web-services. 
Amazon EC2? is part of AWS as a public cloud solution. 
EC2 let you start servers, called instances, on-demand. You 
are billed per-hour of usage and can stop an instance at any 
time. You can start your instance in a given geographic 
Region and Availability Zonet. 

Because of the large adoption of EC2, Amazon added a 


Amazon EC2 


Region (US East Coast) 


Region (Europe) 
us-east-1 


eu-west-1 


_Availability Zone LY 
/ eu-west-1b) + 


“/Availability Zone», 
“ @u-west-la | 


_-Availability Zone. 


_Availability Zone 
} / us-east-Ib 


. us-east-la I 


Other AZ Y 


¥. Other AZ : 


Other Regions 


Fig 1. Amazon EC2 : Region and Availability Zone 


layer of indirection so that each AWS account’s Availability 
Zones can map to different physical data center equivalents?. 
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When starting an instance, you will generally have to 
provide at least four pieces of information: the AMI® (server 
image), the instance type’ (ram/CPU/arch), the Security 
Group® (firewall rules) and the Availability Zone. You can 
start an instance by using the Amazon EC2 API or the web 
console. By default an Amazon instance is started with some 
defined ephemeral storage space. Any data on it will be lost 
if you stop the instance. To use permanent storage you need 
to use solution like EBS. When stopping a server you lose 
the attached public and private IP. A new instance will have 
different IPs. The only way to keep a public static IP is to 
use Amazon EIP’. 


In September 2010, Amazon introduced some important 
features: Tagging, Filtering, Import Key Pair, and 
Idempotency. By adding customized tags (like hostname or 
profile name) you can easily filter your instances or EBS!° 
volumes based on the given tags. In short, tagging and 
filtering lets you manage your own meta-information for 
each Amazon cloud resources. 


II]. KEEP SOME ORDER IN YOUR CLOUD 


There are many client bindings built for the Amazon 
EC2 API which make it quite easy to use and implement. We 
started to use EC2 in 2008 by taking advantage of the 
computing ability that Amazon provide. We start a few 
dozen of servers for a few hours a day to fetch and aggregate 
data from different partners. The aggregated data are pushed 
into our shared MySQL cluster at our Colo center. 


Internet 


Amazon EC2 API Servers 


(7) 
EC2 Instances | | 


Application 
Server 


Fig 2. EC2 and Colo center 


In Figure 2, you can see how we interact with EC2 to 
crawl our partners API and store data in our database. 1) our 
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application server calls the Amazon API at defined interval 
to start Amazon instances. 2) Amazon launch the instances 
we requested. 3) we push our code to the EC2 instances and 
start our program. 4) our application open an SSH tunnel to 
our databases. 5) we crawl our partner’s API and aggregate 
the data as we want. 6) we write the results to our databases. 
7) EC2 instances kill them-selves when they are done 
crawling. 

This design works great and requires really low 
maintenance. Though, when you work in a startup 
environment, product evolve quickly. We needed to quickly 
develop our new video analytic product with a large number 
of servers to handle the analytics for billions of video stream 
per month. We chose to build this new product entirely on 
EC2. This let us to change the application quickly while the 
product grew without worrying about adding servers, rack, 
wiring, etc. Because of the nature of our product, we needed 
permanent storage, that’s why we started to use EBS 
volumes. 

To be able to add or remove nodes easily with different 
instance profiles it’s important to be able to quickly identify 
what a server is doing and identify what its role is (Web 
server, Database, Hadoop namenode/datanode, etc.). To keep 
some order in our cloud we used clear security group, human 
readable hostnames (no ip-XX X.compute.internal or domU- 
XXX.compute.internal), NFS home directories and a strong 
and flexible monitoring. 


Engineer 


DNS Server 


(a 


VPN Server 
192.168.0.0/24 


10.0.0.0/8 


LDAP Server NFS Server 


Fig 3. EC2 and our private network 


A. Controlling access to the servers 


1) Amazon EC2 Security Groups can get a bit 
cumbersome to manage especially when you want to access 
servers from anywhere without updating your rules while 
keeping a strong security policy. It’s easy to forget to update 
or remove an old ip, etc. This is why we chose to manage 
our servers by setting up OpenVPN'! servers on two of our 
Amazon instance using static IP, aka EIP. The ingress rules 
for our Security Groups stay simple by allowing SSH only 
from those VPN servers and by opening only the required 
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public port if any. The VPN (using OpenVPN with auth- 
Idap!* plugin) add another layer of security ensuring that 
only people with a valid username and password and a valid 
unique certificate can get access. 

2) In addition to firewalls, we needed to give restricted 
access to some DBA, developers or contractor. Some needed 
root access. Our rule of thumb: “You only get the permission 
you really need”’. No need to give root access to every server 
to your boss if he don’t even know what to do with it. To 
manage those permissions and user accounts we used 
OpenLDAP!*. All our instances are configured with 
pam_ ldap. We extensively use pam filters to grant access 
based on hostname, host group and Availability Zone. 





pam_ filter |(host=dev-mysql01.us-east-1b)(host=dev-mysql01.us- 
east-1)(host=dev-mysql01.\*)(host=dev-mysql\*.us-east- 1b) 
(host=dev-mysqlI\*.us-east-1)(host=dev-mysql\*.\*)(host=\*.us- 
east-1b)(host=\*.us-east-1)(host=\*) 











At any time we can grant or revoke access to any users for a 
server or multiple servers in one or multiple regions. 


B. Identify running instances 


Having obscure hostnames doesn’t make your life easy 
when you start to deal with multiple instance profiles and 
multiple products with an extra-small sysop team (one or 
two people). When a product is in its early days with 
frequent changes, developers often needed access to the 
servers to be able to troubleshoot issues and find out why 
their last release wasn’t working as expected. To help 
identify our hosts we used one of our EC2 instances as a 
management server configured with a DNS service (Bind!*) 
patched for the Idap backend!> and a LDAP service 
(OpenLDAP 2.4) using some of our own LDAP schema. For 
each host we stored in LDAP the private IP (10.0.0.0/8) and 
the public IP (it can be an EIP). Each host that we started 
used an AMI configured with the given private IP of the 
name server. Our resolv.conf would look like this: 





domain <product>.private 
search <product>.private <product>.public 
nameserver 10.X.X.X 











When starting an instance we also used the user-data to 
update the /etc/hostname. The user-data is an optional 
parameter you can use when starting an EC2 instance. This 
can support up to 16KB data. On the server you can fetch 
those user data at boot through an init script doing a curl 
command: 


curl -s http://169.254.169.254/latest/user-data 
From there, a lot become possible. In our case, we initially 
used the user-data just to pass our server hostname, example: 


‘“hostname=dev-mysql01”. Note that, in the same way you 
can have access to many meta-data of your running instance: 


curl -s http://169.254.169.254/latest/meta-data/ 
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The pam ldap was configured to use the DNS entry to get 
the LDAP server IP. 


uri ldaps://Idap.<product>.private 


We started instances using a Java command line tool, called 
ec2ldap. We wrote it using Typica'® (Java Binding for 
Amazon API), SQLite!’ and LDAP. We kept tracking of all 
our instances name and profiles in a SQLite database and 
used a script called Cerveza wrote in Tcl/Tk to access our 
hosts easily and do large maintenance with some one-liners: 


/cerveza remote mysql[1-40] service mysql restart 


With the SQLite database and Cerveza, it was easy for 
us to run over all our EC2 instances and update the 
resolv.conf if our management box went down and got a new 
IP. This worked well for a while but there were some 
important single point of failures!* (SPOF) that finally bit us. 


C. The benefit of NFS auto-mounted home directory 


As stated earlier, developers needed easy access to the 
servers. To make their life easier we did setup an NFS export 
on our management box and used Autofs to mount the home 
directories on all our EC2 instances. 





/etc/auto.master: 
/home /etc/auto.home intr,soft 


/etc/auto.home: 
* fstype=nolock,noatime,soft,intr nfs.<zone>.private:/ 
home/& 











This setup makes it easy to run a script across multiple 
instances without copying the instance to each host. It has 
been a great help in our dev environment but also when 
troubleshooting many servers in production. It’s convenient, 
because you get your bash aliases or user script everywhere 
you login, etc. Unfortunately there is a downside, your 
access files can get slow, home dir can get stuck or 
permanently mounted if a service write to the home 
directory or keep a file descriptor open, etc. 

In many cases we ended up using those auto-mounted 
home directories to run shared scripts on the first boot of an 
instance to deploy code, build our Raid devices with 
multiple EBS or reassemble them using mdadm or LVM. 


D. Instance monitoring with Ganglia!’ and Nagios”° 


We choose to monitor our infrastructure with Nagios 
and Ganglia. It was a no-brainer for Nagios as we already 
used it to monitor our Colo servers and were quite used to its 
configuration. Ganglia was new for us as we used to graph 
our servers with Munin2!. In our case, the decision between 
Munin and Ganglia was made on poll versus push model. 
Munin server poll each client, this requiring many resources 
on the main server especially when building each graphs. 
Ganglia uses a push model, each client report to the main 
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process (gmond). Ganglia allow much more flexibility in 
graphing grids and clusters although we couldn’t use the 
multicast support. For security purposes, Amazon EC2 
doesn’t let you to do multicast (or broadcast) on their 
network. 


We configured multiple gmond processes on our 
management box to listen on different ports and collect data 
in different cluster group (one per Amazon Security Group) 
then just one gmetad process to collect all the data from each 
local gmond. This helped us to organize our graphs. Our 
EC2 instance were getting configured at first boot by 
running a ganglia configuration script that ensures the 
instance reports to the correct gmond process (if instance in 
SG dev, reports to port 8630, if SG mysql, report to 8631, 
etc.). Ganglia is a powerful solution so we were able to use 
the Python module to graph?” our Java process using JMx? 
with JPype*. All those data are grouped in different 
dashboard and give us a quick way to spot issues. 


For our Monitoring we use Nagios 3.2 with NSCA* and 
regex (in nagios.cfg: use regexp matching=1). We defined 
some generic service definitions for each cluster of servers. 
Some of our checks were directly looking at our RRD*® data 
generated by Ganglia. Because of the quickly growing 
numbers of servers and services monitored we started to 
have too much I/O (read/write RRD files). We started to use 
trdcached*’ which solved most of the problem but we still 
had many Nagios active checks which occasionally lead to 
swapping or slowness during checks. To fix the problem we 
simply split our ganglia load between two different 
management boxes, both servers use rrdcached to reduce 
IOs. 


IIT. LEARNING THE HARD WAY 
(or how to lock yourself out of your servers...) 


While we were building our infrastructure and 
upgrading our network configuration, we were aware of few 
SPOF being introduced but they had a low impact or no 
impact on our production environment. However, what was 
initially designed for convenience and laziness became 
critical. The way we started to depend on those services 
make them even more critical. We didn’t see it coming 
initially. This is the story of a three days nightmare starting 
with a VPN outage, then NFS/LDAP outage locking us out 
of all our EC2 instances. 


A. The outage 


1) For some reason, our file system storing our Nagios and 
Ganglia files were corrupted (EBS or Raid problem). This 
lead to many process getting stuck trying to access the 
faulty device. Too many resources were being used so the 
OOM Killer started killing processes, including our VPN 
process. After many reboots of the management server, 
nothing came back up. The console output showed a 
prompt for fsck check due to the faulty device. We had to 
kill the instance and start a new one. 
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2) The new instance failed to start. It prompted us again for 
fsck on our EBS volumes (used for NFS home dir). In 
fact, the mount point was defined in the fstab in the AMI, 
so it kept trying to mount the failing EBS with no way for 
us to fix it. There is no KVM with EC2, so we didn’t have 
any way to try to recover from this situation. We ended up 
starting a new instance with an old AMI from which we 
removed the fstab so we could start the instance and finish 
it manually by running fsck, etc. 


3) After reboot, our instance got a new Private IP allocated. 
This meant a new IP for our DNS, LDAP Producer and 
NES. After recovering our instance we reimported our last 
Idif backup to LDAP. As the DNS server IP was 
hardcoded in our instance, we had to “manually” login on 
each server using a local account with the ssh keypair 
then update the resolv.conf, dnsmasq.conf, dhclient.conf, 
restart autofs and dhclient. 


4) Unfortunately, as we used an old AMI for our 
management box, we lost many configuration settings 
breaking our Nagios and Ganglia services but also our 
command line tool (Cerveza) used to query our SQLite 
DB and easily access any hosts. This slowed our ability to 
recover a basic setup to be able to see what was wrong 
and fix it. 


5) The ssh backdoor didn’t always worked. We had to restart 
many instances manually. At boot they couldn’t load our 
boot scripts from NFS. We had to login and finish the 
boot process manually by fixing Autofs then run the boot 
scripts. We also had to reconfigure many ssh tunnels, fix 
mysql replication, and recover missing or outdated 
configuration files, etc. 


6) Some of the servers were using private IP in the EC2 
Security Group, rebooting those server make the outage 
more complex as we needed to review all our security 
rules. 


Luckily, this outage didn’t affect our production services but 
it did lock us out of our servers for a long time. Needles to 
say, we took some time to revisit what went wrong and how 
we can fix it. 


B. What we quickly fixed 


1) One of the biggest pains during this outage, was our pam 
ldap and ssh configuration. Long timeout was preventing 
us from login into many servers (the cumul of timeout 
were higher than our ssh LoginGraceTime timeout, set to 
2 min.), so the first thing was to reduce the autofs and 
ldap timeout and change nsswitch to look at the local 
account before Idap so even if our dns and Idap goes 
down, we still have an ssh backdoor to login and do local 
fix or maintenance. 
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/etc/auto.master : 
/home /etc/auto.home timeout=5,retry=0,rw, intr, soft 


/etc/nsswitch.conf: 


passwd: files ldap 

shadow: files Idap 

group: files ldap 
/etc/ldap.conf: 


timelimit 15 
bind_timelimit 5 











2) We fixed our resolv.conf to handle better failover using: 
options attempts:1 timeout: 1 

3) We set up a better service and dns caching on each host 
using nscd instead of dnsmasq. We enabled caching for 
group, passwd, hosts and services. 

4) We configured a secondary VPN service on our second 
management server and configured the OpenVPN clients 
to use “remote-random” option. 

5) We stopped saving our fstab in the AMI so we could boot 
our instance even when a fsck is required. 

6) We stopped using private IPs in our EC2 security group 

7) We use a Haproxy”® loadbalancer for DNS and LDAP 
service via Public IP using EIP. 

8) Better version control of our boot scripts and AMI. We 
now manage almost everything with our configuration 
management tool. 
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Fig 4. Network Flow between Clusters and Grid 


IV. GOING WORLWIDE 


While our business evolved, we had a need to have a 
presence in different part of the world. This is easy to do 
with Amazon multiple region, though we have response time 
constraint with many partners. Our ninety-ninth percentile 
response time must be under 120 ms, including network 
round trip. Our partners are within 60 ms of our Amazon 
servers so it doesn’t leave us much room especially if you 
consider the network variation inside Amazon’s network or a 
noisy neighbor. 
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While building our international clusters, we tried to 
keep two goals in mind. First, how to reuse our existing tools 
and automate as much as we can. Second, do not create new 
SPOF failures in one region that would impact the others. 


A. Simplify the instance boot process 


With over 500 EC2 instances spread in multiple regions, 
we had to make our life easier. We got rid of our tool 
“ec2Ildap” in Java and rewrote Cerveza in Python using 
Boto?? (Amazon API binding for Python). We rewrote 
Cerveza to handle full instance start/stop/reboot with profile 
management. We chose Python over Java because of the 
scripting nature of Python. We didn’t want to slow ourselves 
down in a compile/release process for this simple tool. A 
scripting language lets us add features quickly and do quick 
bug fixing. 


Our previous outage led us to stop using SQLite. We 
wanted a solution where we do not have to rely on a local 
database or to be forced to start/stop instances from a 
management server. We replaced SQLite for Amazon 
SimpleDB*° to store only profile information. For the rest 
we leverage the Tagging feature of the Amazon API. All our 
hosts or EBS volumes are tagged with hostname, device 
name, etc. This gives us much more flexibility as we can run 
Cerveza from our own laptop. We are not depending on the 
location of our SQLite database, we can start, stop, reboot 
instances from anywhere for any kind of server we want to 
start. The other major thing we got rid is the home made 
AMI. It takes lot of time to build and maintain an AMI, so 
it’s not practical to deploy changes, etc. We chose to move to 
the official Ubuntu EC2 AMI and use cloud-init?!. This is 
powerful. Cloud-init allow us to kick off our instance with 
different profiles by passing advanced user-data or scripts. 


When starting a host with Cerveza for the first time we need 
to specify the instance profile we want to start (Hadoop 
node, MySQL, Java server, etc): 


cerveza -m noc -- --zone ap-southeast-la --start demo01 
--profile UbuntuGeneric32Bit 


To stop the host: 


cerveza -m noc -- --zone ap-southeast-la --stop demo01 


To start the host a second time, we don’t need to define the 
profile again, cerveza know it by querying SimpleDB : 


cerveza -m noc -- --zone ap-southeast-la --start demo01 


Besides using LDAP for DNS data and SimpleDB for 
profiles information of existing hosts, Cerveza also uses 
Yaml*? to define our instances profiles and volume profiles. 


USENIX Association 








--- !InstanceProfile 
name: UbuntuGeneric32Bit 
desc: Ubuntu Generic instance profile without EBS 
Volumes 
aws: !InstanceAws 
ami: { us-east-1: ami-a6f504cf, us-west-1: 
ami-957e2ed0, ap-southeast-1: ami-7c423c2e, ap- 
northeast-1: ami-3a0fa43b, eu-west-1: ami-339ca947 } 
security group: devzone 
key_pair: tm-devzone 
type: cl.medium 
elastic ip: false 
volumes: [ ] 
startup scripts: [ | 
shutdown. scripts: [ shutdown | 
user_data: [ cloud-config-base.txt, setup-hostname.sh, 
root-login.sh, cloud-config-puppet.txt | 
check ec2_ kernel: 2.6.35-28-virtual 








Our Ubuntu Generic 32 Bit instance is generally used for 
development purpose. In this profile we just define some 
basic information (instance type, key pair, default SG, AMI, 
etc.) but also important user-data. By passing a list of files, 
Cerveza will automatically concat all the given file to 
generate a compressed mime-multipart data file and pass it 
in the user-data when launching the instance. Cloud-init will 
read it and execute each script when the server boot. Cloud- 
init allow advanced configuration and many possibilities. In 
our case, the user-data script cloud-config-puppet.txt let us 
configure Puppet*?, our configuration management tool, at 
boot time. 


B. Use a configuration management tool 


We were thinking about using a configuration 
management tool for a long time, but hesitated until LISA 
10. As we changed our AMI and started to use cloud-init, we 
took the opportunity to deploy puppet on all our hosts and 
start using it. We briefly looked at Cfengine** and Chef 
too, but finally decided to go with Puppet as it seemed a 
little more documented and already fully integrated to 
Cloud-init. 


Configuring and deploying puppet is fast and easy but 
using it properly is not that obvious. We had to deal with a 
couple of annoying problems like huge CPU spikes on each 
client, obscure errors for non-initiate people, process not 
running because of a lock file after reboot, etc. We addressed 
most of those issues. We found out that abusing of Augeas*® 
is not necessarily good. We were able to speed up our puppet 
run from over 400 seconds to less than 15 seconds by 
replacing Augeas by puppet templates (mostly on long sysctl 
configuration). We use some ruby environment variables?’ to 
optimize each puppet client run, though we are still 
experimenting those. We stopped running puppet as a 
daemon as “fileserver” used too much resources. We had 
cases were puppet was using over 1GB of ram leading OOM 
Killer to kill some other process like our Membase?® server. 
We now setup our puppet in a crontab running every half an 


LISA 711: 25th Large Installation System Administration Conference 183 


hour. To avoid a peak of requests on our puppet master we 
run the cron at random minutes on each client. 





# schedule puppet to run via cron 
$minutel = generate(‘/usr/bin/env’, 'sh', '-c', 'printf $((RANDOM 
%29+0))') 


cron { 
"puppet run”: 
ensure => present, 
command => "/usr/sbin/puppetd --onetime --no-daemonize -- 


logdest syslog > /dev/null 2>&1", 
environment => [ 'RUBY HEAP MIN SLOTS=500000', 
"RUBY HEAP SLOTS INCREMENT=250000', 
"RUBY HEAP SLOTS GROWTH _FACTORS=I!', 
'RUBY GC MALLOC _LIMIT=500000' 
I, 


user => "root", 

minute => $minutel, 
— woe. 

hour => EN: 











In the end, Puppet makes our life easier to manage and 
change configuration on multiple servers in four different 
data centers. Our puppet masters are located in our Colo 
center on US east coast. They are setup with Apache 2 + 
Phusion Passenger*’ with one master and one failover server. 
The failover server also handles the puppet reports using 
Puppet Dashboard*?. We patched the puppet clients to report 
their FQDN as hostname instead of using there certificate 
name. 


We currently don’t have a clear dev environment for our 
puppet configuration, though our dev servers are setup to use 
a different environment so we can test our modules changes 
in dev before pushing to production. We are looking at better 
ways to manage this. 





in puppet.pp: 


class puppet inherits puppet::init { 
if $hostname =~ /“dev-*$/ or $ec2_ security_ groups == 
"devzone" { 
augeas { 
"puppet env": 
context => "/files/etc/puppet/puppet.conf/main", 
onlyif => "get environment != 'development'", 
changes => "set environment 'development", 
notify => Exec["puppet"]; 














j 
j 
j 
in puppet.conf: 
[development] 
manifestdir = $confdir/dev/manifests 
manifest = $manifestdir/site.pp 
modulepath = $confdir/dev/modules:$confdir/modules 
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Fig 5. Network Flow between multiple AWS regions 


C. Mirroring DNS, LDAP, NFS 


Because of the multi-region and our response time 
constraint, we had to get DNS servers on each region. We 
use some “gateway” servers whose role is to serve as local 
DNS server, LDAP and NFS. As our DNS depend on LDAP, 
we initially setup LDAP Proxy with query caching which 
was working great except when running a non-cached query. 
We were getting some latency spike of up to four seconds 
for a DNS response. This was affecting our production 
response time in some cases increasing our percentage of 
timed out requests. We changed this configuration to use 
LDAP syncrepl*!. Each LDAP server on each region is a 
master replicating one of our master server on US EAST. 
This solved our DNS response time and pam Idap response 
time. Though, since we use Autofs for our home directories 
we had to address the problem for our NFS server. On each 
region we use a NFSv4 mount with FS-Cache 
(cachefilesd*”), this aimed to improve read speed on each 
region. The key thing we did was to remove the NFS mount 
point from the updatedb configuration because it would 
generally kill the server performance. 





/etc/updatedb.conf: 


PRUNE BIND MOUNTS="yes" 

PRUNEPATHS="/tmp /var/spool /media /opt/openldap/var / 

EBS /home" 

PRUNEFS="NFS nfs nfs4 rpc_pipefs afs binfmt misc proc 
smbfs autofs iso9660 ncpfs coda devpts ftpfs devfs mfs shfs sysfs 
cifs lustre lite tmpfs usbfs udf fuse.glusterfs fuse.sshfs ecryptfs 
fusesmb devtmpfs bindfs" 











We are still not fully satisfied of our current solution and 
may stop using NFS for our home directory as it introduces a 
possible snowball effect in case our NFS fails on US east. 
Auto-mounted home directory doesn’t give us any more 
added value as the product matures and our server 
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infrastructure grows. Also, we are having more clients using 
the NFS doing multiple mount/unmount leading to frequent 
home directories being stuck with a “Stale NFS file 
handle’. 


D. What else? 


To speed up our application deployment in multiple 
regions we started to use Amazon S$3** with localized 
buckets. Instead of pushing our files from our Colo to each 
server, we push the files once to each of the localized S3 
buckets then fetch the files to release on S3 from each server 
and deploy them locally. 


Overall, with this infrastructure, we still have room for 
many improvement: 


1) One clear blocker is NFS, we definitely plan to entirely 
remove NFS with auto-mounted home directory and get 
back to a more standard way to manage our servers. We 
are introducing more security checks and rules limiting 
production access so there shouldn’t be any more need of 
user home directory being synchronized this way on all 
our servers. 


2) We currently have two different sets of VPN and LDAP 
servers, one in our Colo and one in EC2. We want to 
centralize them to simplify our user and ACLs 
management. 


3) We still have some “Gateway” servers, doing bridge 
between our regions. They are not based on the Ubuntu 
EC2 AMI. For lower maintenance on our side, we want to 
migrate everything onto the official Ubuntu EC2 AMI and 
fully use Cloud-init possibilities. We also want to get to a 
more standardized approach of managing our setup by 
using our internal Debian repository when required. 


4) We are looking at Amazon VPC* to be able to better 
manage our private IPs and clusters. It can help to have 
better security policies in place preventing your backend 
from being accessed int the public internet, etc. 


5) We plan to look again at Amazon ELB*® to manage our 
different load balancing. One of the biggest drawbacks we 
had with ELB was the lake of visibility. No access logs 
and no clear error reporting make things hard to 
troubleshoot especially when you start having 500 errors 
returned by ELB during traffic spike. 


V. LESSON LEARNED 


Evolution of your infrastructure must stay fault-tolerant 
in any case. What was simple and working at first can get 
complex in a multi-region / high latency environments. 


In a small team with limited resources you will have 
little time to get everything right. You will miss important 
point leading to outages. Make sure to have a valid backup 
strategy and have a recovery procedure. 


Never build a SPOF, even if it’s for a “non-critical” use. 
As you start to rely more on this services (and you generally 
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don’t see it coming), your SPOF can have more impact than 
you would anticipate. 


Infrastructure legacy can become a pain to maintain. 
Don’t be afraid to revisit what you did and change it. What 
was true at one point of your design may not be true 
anymore. 


Scaling your infrastructure in a fast paced environment 
require a lot of automation. which is why using a 
configuration management tool early would prevent you 
many headaches later on. 
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Abstract 


Protecting computer and information systems from secu- 
rity attacks is becoming an increasingly important task 
for system administrators. Honeypots are a technol- 
ogy often used to detect attacks and collect information 
about techniques and targets (e.g., services, ports, oper- 
ating systems) of attacks. However, managing a large 
and complex network of honeypots becomes a challenge 
given the amount of data collected as well as the risk that 
the honeypots may become infected and start attacking 
other machines. In this paper, we present DarkNOC, a 
management and monitoring tool for complex honeynets 
consisting of different types of honeypots as well as other 
data collection devices. DarkNOC has been actively used 
to manage a honeynet consisting of multiple subnets and 
hundreds of IP addresses. This paper describes the archi- 
tecture and a number of case studies demonstrating the 
use of DarkNOC. 


1 Introduction 


Because of the value of the data they store and the re- 
sources they provide, information systems become tar- 
gets for attackers and must be protected. To better se- 
cure computer systems from external threats, security 
researchers aim to understand attackers and the differ- 
ent techniques they use to compromise computers and 
achieve their goals. One possible approach is to use a 
target computer, called a honeypot, which is not used 
by normal users. Therefore, all the activity towards this 
computer can be considered malicious. 

Individual honeypots or networks of honeypots have 


been used to conduct various studies of attackers [1,9] 
and analysis of cyber crimes such as unsollicited elec- 
tronic mails, phishing [10], identity theft and denial of 
service. The computer security community has used hon- 
eypots to analyze different techniques deployed by the 
attackers to reach their objectives. Attackers’ arsenal 
includes distributed denial of service [24], botnets [2], 
worms [11] or SPAM [15]. However few studies focus 
on the usage of honeypots data to help network adminis- 
trators to better protect their production networks. Hon- 
eypot deployment is challenging and the architecture of 
such networks is complex. For example, distributed hon- 
eynets require secure tunnels and different levels of pro- 
tection must be in place to ensure a total containment of 
attacks targeting the honeypots. In addition, honeynets 
require constant monitoring to guarantee that protection 
systems (for example firewalls, traffic shappers) and data 
collection are operating correctly. Depending on the size 
of the honeynet, the volume of data collected can be im- 
portant and impacts significantly data processing and ex- 
traction. To be integrated as a security tool, honeypots 
data must be presented and translated in meaningful way 
to network administrators. 


In this paper, we introduce DarkNOC, a solution de- 
signed to efficiently process large amount of malicious 
traffic received by a large honeynet, and to provide a 
user-friendly Web interface to highlight potential com- 
promised hosts to security administrators, as well as to 
provide the overall network security status. DarkNOC is 
used to manage the UMD honeynet, a network of 2,000 
honeypots from which information about attacks is con- 
tinuously extracted and provided to the security team to 
help them better protect the production network. 
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The rest of the paper is organized as follows. In Sec- 
tion 2, we provide an overview of the architecture and 
operation of DarkNOC. In Section 3, we describe the 
outputs and views provided by the DarkKNOC. We pro- 
vide a number of case studies using DarkNOC in Section 
4. Finally we review the related work in Section 5, we 
provide some remarks on future work in Section 6 and 
conclude the paper in Section 7. 


2 DarkNOC Architecture 


This section describes what DarkNOC does, how it col- 
lects data, and its internal structure. 


2.1 System Architecture 


DarkNOC manages multiple types of honeypots and in- 
formation sources as illustrated in Figure 1. The UMD 
honeynet consists of low interaction honeypots (LIHs) 
such as Nepenthes [3] as well as high-interaction honey- 
pots (HIHs) consisting of virtual or physical machines 
running real operating systems, applications, and ser- 
vices [5]. The UMD honeynet supports multiple sub- 
nets consisting of IP addresses contributed by different 
organizations participating in the research. DarkNOC 
collects multiple sources of information from different 
devices (e.g., NetFlow from Gateway, Snort events from 
Snort Sensors [20], and malware from Nepenthes), an- 
alyzes the data, and presents it to users in an efficient 
and actionable manner. The details of the data views and 
their use in analyzing security incidents are discussed in 
Sections 3 and 4. 

The current information sources consist of the follow- 
ing: 


e NetFlow Data: DarkNOC uses nfdump! to extract 
NetFlow data collected on the main gateway of the 
honeypots. The flow data provides enough infor- 
mation to determine the number of attackers, the 
different source and destination IP addresses, and 
the different source and destination ports. Specifi- 
cally, each NetFlow record summarizes communi- 
cation between two network end points (defined by 
the IP addresses and port numbers of the end points) 
including the time, duration, and numbers of bytes 
and packets (see example below), but does not con- 
tain any payload information (i.e., content of the 
messages transmitted). 


Date flow start Duration Port Src IP:Port -> Dst IP:Port Packets Bytes Flows 
2010-02-09 06:43:... 4294966.937 TCP 218.8.251.187:20347 -> x.x.x.x:80 2 94 1 
2010-02-09 06:43:... 4294966.977 TCP 218.8.251.187:20347 -> x.x.x.x:80 2 94 1 


'http://nfdump.sourceforge.net/ 
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e Snort Events: Snort [20] is an Intrusion Detection 
System (IDS) for detecting attacks and potential in- 
trusions. Snort provides information about the types 
of attacks used against the honeypots. 


e Malware Collection: Nepenthes acts as a passive 
malware collector by emulating common service 
vulnerabilities and allowing attackers to inject the 
malware binaries. Nepenthes provides a log of each 
malware submission containing information such as 
the date and the vulnerability used but also the bi- 
nary injected. This allows DarkNOC to see what 
kinds of malware are successfully uploaded, the se- 
curity signatures, and port used. It also allows to 
measure the efficiency of the security solution pro- 
tecting the network. 


2.2 DarkNOC Software Architecture 


The design of the DarkNOC software architecture was 
driven by the following constraints: 


e The aesthetics from the user’s point of view: The 
user interface should be easy to access and the im- 
portant data should be automatically highlighted. 
This interface should be highly portable so that 
users can use different operating systems and access 
the system from different geographic locations (..e., 
not tied to one dedicated machine). 


e Speed: The user interface must be fast and the user 
should not have to wait for the results to be dis- 
played. Processing high volumes of data can be 
time consuming and if the processing is started only 
when the user requests a data view, the response 
time may not be satisfactory. Therefore, our sys- 
tem uses data pre-processing when possible to en- 
sure fast response. 


e Data validity: The data displayed should be reason- 
ably up to date and reflect the current activity. 


To meet these requirements, the application software 
has been divided into three different parts: 1) a graphical 
Web front-end, 2) back-end, and 3) alerting module. The 
front-end generates a Web page displaying the different 
information. The back-end extracts the necessary data 
from the flows and creates the different graphs. 


Back-end Module: Written in Perl, the back-end mod- 
ule is a background process that updates the information 
displayed by the front-end every 5 minutes based on the 
NetFlow data. The separation of flow processing from 
the display was necessary to guarantee a fast response 
time at the user interface, because the extraction of flow 
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Figure 1: System architecture 


data can be time consuming. Since the flow data is up- 
dated every 5 minutes by the flow collector, a continu- 
ous live update of the displayed views is unnecessary. 
However it requires the tool to process the new flow files 
within 5 minutes. DarkNOC provides information for 
the last 24 hours and the last 5 minutes. Two different 
processes generate the 24 hours and 5 minutes statistics. 
For about 2,000 IP addresses, an average of 15,995 flows 
are generated every 5 minutes representing about 5 mil- 
lion flows per day. It takes an average of 7.4 seconds 
to process a newly created flow file. Given this num- 
ber, DarkNOC is able to process almost a hundred times 
more flows within 5 minutes. Generating the statistics on 
the last 24 hours is computationally more expensive and 
longer. It takes an average of 130 seconds. However, it is 
not necessary for this process to finish within 5 minutes. 


A lock file prevents multiple executions of this process at 
the same time. For each subnet and the global view, the 
back-end generates the different graphs, the list of desti- 
nation ports, the list of attackers and the list of targeted 
honeypots. The graphs are created using RRDTool’, an 
open source tool for storage and retrieval of time series. 


Graphical User Interface: The graphical user inter- 
face organizes the different data necessary to present a 
summary of the honeypots activity. Web technologies 
such as the PHP language and Cascading Style Sheets 
are used. A Web page is extremely portable and requires 
no configuration on the client side. Figure 2 shows the 
homepage of DarkNOC. The content is described in Sec- 
tion 3. The graphical user interface first provides a global 


*http://oss.oetiker.ch/rrdtool/ 
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Figure 2: DarkNOC’s graphic user interface 


view of the activity of the honeypots: the data displayed 
includes all the subnets. The user has then the possibility 
to reduce the scope of analysis to one subnet. To pre- 
vent unauthorized access, the application uses an HTTP 
authentication over SSL to protect DarkNOC’s directory 
on the Web Server. Apache is configured to authenti- 
cate users against an LDAP server where all accounts are 
centralized. User objects belonging to the group Dar- 
kNOC have access to the application. Because of le- 
gal and confidentiality reasons it is necessary to filter 
the information displayed by DarkNOC. Once authen- 
ticated DarkNOC retrieves the user name stored in the 
$ SERVERL’PHP_AUTH_USER’] variable and matches it 
with the user’s table in the database to determine which 
subnets to display or not. If the user is allowed to access 
more than one subnet, DarkNOC will reflect the user’s 
rights in the global view but also in the subnet selector. 
If the user has access to a single subnet, the subnet will 
be automatically selected with no possibility to select an- 
other one. 


Alerting Module: The alerting module is a process ex- 
ecuting a specific query on the flow data. The results are 
sent by email to a specific group of users. Users have the 
possibility to create their own flow query based on the 
nfdump filter syntax and to specify the recipients of the 
alerts. The module is currently launched twice a day: at 
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6:00 AM and at 6:00 PM. It can be executed more fre- 
quently if more real-time alerts are required. 


3 Display Description 


The layout of the graphical user interface of DarkKNOC 
presented in Figure 2 organizes the different pieces of in- 
formation gathered from the most global and important 
to the most detailed concerning the current activity of the 
honeypots. The user interface of Dark NOC has been de- 
veloped to ease the comparison of the different sources 
of information and the comparison of the different sub- 
nets. 

The Web page provided by DarkNOC is divided into 
three different sections: 1) status of the subnets, 2) flow- 
based information, and 3) Snort events. Each section will 
provide information that will reduce the number of pos- 
sible explanations when an anomaly in the traffic is iden- 
tified in DarkNOC. The first screen provided is a global 
view of the honeypots activity. The user can select a spe- 
cific subnet to drill-down to a more detailed view of the 
subnet activity. 


3.1 Subnet Status and Network Traffic 


The first part of the Web page shown in Figure 3 is com- 
posed of a table giving the status of the low interac- 
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Figure 3: Subnets status section 
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Figure 4: Top and bottom 10 transport ports targeted 


tion honeypots (LIH) running Nepenthes, the status of 
the tunnels to different organizations, and the number of 
malware collected for each subnet since the initialization 
of DarkNOC. The notion of a tunnel is specific to the 
UMD honeynet. It allows to redirect the network traf- 
fic from remote locations to the honeypot network trans- 
parently. Hence, it is possible to use other participating 
organizations’ IP addresses. A graph representing the 
incoming and outgoing traffic in bytes per seconds is in- 
cluded in the status section as well. This section provides 
essential indications on the state of the main components 
of the UMD honeynet, 1.e. tunnels and main gateway. 
The graph gives an overview of the UMD honeynet in- 
frastructure load and can help to detect anomalies in the 
traffic. 


3.2 NetFlow Data 


The NetFlow section provides information extracted 
from the NetFlow data collected at the edge of the hon- 
eypots network. Figure 5 presents a graph showing the 
number of attackers over time for each subnet of the hon- 
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eypot network. Each unique IP address that does not be- 
long to the honeypots is considered a unique attacker. 
The graphical user interface provides several graphs that 
display the number of attackers at different time scales: 
one day, one week, and one month. Figure 6 presents a 
graph showing the number of flows over time for each 
subnet of the honeypot network. Separate graphs are 
used to display the number of flows at different time 
scales. 


These two graphs shown in Figures 5 and 6 make it 
easy to observe the activity of the honeypots for each 
subnet. Comparing the numbers of flows and attackers 
can reveal attack characteristics. For example, an in- 
crease of the number of flows while the number of at- 
tackers remains relatively steady means that one or sev- 
eral offenders may have launched an attack that generates 
large amounts of flows such as port scanning and brute- 
force activities. It can also mean that a large network 
behind a network address translation system 1s compro- 
mised and targeting the UMD honeynet. DarkNOC also 
makes it easy to compare trends between the different 


193 


194 


subnets. For example, it is straightforward to identify 
peaks in the number of attackers or flows that occur at 
the same time in different subnets, as well as changes in 
the attacks directed to only one of the subnets, indicating 
a targeted attack. 
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Figure 6: Number of Flows 


The tables in Figure 4 show the top and bottom 10 
ports targeted by the attackers during the last 24 hours. 
For each port, the number of flows and the percentage of 
the total number of flows are provided. It makes it easy to 
identify the most popular services and to protect the net- 
work accordingly. The severity of an attack is not related 
to the number of flows it will generate. Attacks towards 
common ports tend to hide smaller attacks against less 
popular ports. This is why we also decided to display the 
bottom 10 ports targeted. 


Finally, Figure 7 represents a word cloud of the top 
20 attackers’ IP addresses. The top 20 IP addresses are 
determined using the number of flows involved in the 
communications between the attacker and the honeypots. 
The size of the font displaying the IP address reflects the 
number of flows generated for that IP address. The same 
representation is used for the top 20 targeted honeypots. 
These word clouds are updated every 5 minutes using 
a 24-hour window. The IP addresses presented in the 
word clouds are clickable: The user can obtain the lists 
of honeypots contacted, services and Snort events related 
to the selected IP address in a separate window. Since the 
honeypot network often hosts different experiments with 
different configurations, the port tables and the targeted 
honeypots make it possible to determine what is attract- 
ing the attackers the most. 
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Source IP Addresses (last 24 hours) 


12.175.243.24 115.135.18.70 118.186.33.57 121.244.147.2 141.158.223.251 


184.164.132.154 19 33.106 194.17.160.1 202.194.4.17 


64.40.106.2 1 ésr2890:15 24. 


544.111.123.190 88.122.140.12 98.71.240.3 


Figure 7: Attacker word cloud 


3.3. Snort Data 


Last 10 Snort Events 


2011-06-06 Stream5: Limit on number of overlapping TCP 

















174.77.190.64 
4:41:53 packets reached 
2011-06-06 Stream5: Bad segment, overlap adjusted size less 174.77.190.64 
£4./ 7190.04 
4:41:53 than/equal 0 
1-06-0 Sasol at 
ate ICMP PING NMAP 94.248.15.15 
4:41:50 
2011-06-06 Stream5: Limit on number of overlapping TCP 190.11.1 
4:41:4 packets reached - 
0011-06-06 Stream5: Bad segment, overlap adjusted size less 190.11.1 
4:41:45 than/equal 0 - 
2011-06-06 Stream5: Limit on number of consecutive small 190.11.1 
4:41:4 segments reached aa 
2011-06-06 stream: Limit on number of overlapping TCP 78,187.81 
4°41:44 packets reached aoa 
2011-06-06 Stream5: Bad segment, overlap adjusted size less 78.187.81.52 
4:41:44 than/equal 0 ee ee 
2011-06-06 stream5: Limit on number of consecutive small 78.187.81.52 
4:41:44 segments reached a ees 
2011-06-06 Stream5: Limit on number of overlapping TCP 94.97.113.112 
94.9/.113.112 
4°41:4 packets reached 


Figure 8: Last 10 Snort events table 


The Snort section presents information about the Snort 
alerts. 

Figure 8 shows a table of the last 10 Snort events col- 
lected on the honeypot network. This table allows honey- 
pot administrators to immediately identify attacks gener- 
ating high volumes of traffic. For example, a brute-force 
attack against a Microsoft SQL server will generate a 
spike in the traffic curves and the corresponding events 
will appear immediately in this table. 

The graph in Figure 9 provides a trend in the number 
of Snort events recorded the current day, the past few 
days, and the past few weeks. 

Figure 10 shows the top and bottom 10 Snort signa- 
tures tables. The tables provide the signature name, the 
number of events for each signature and the percentage. 
Large scale attacks such as port scanning or brute-force 
attacks may generate several events. As a consequence, 
smaller but still important attacks may not appear in the 
top 10 signatures. This is why the bottom 10 Snort sig- 
natures are also provided. As an example, consider the 
snort signature SHELLCODE NOOP shown in the Bot- 
tom 10 Snort events of Figure 10. This signature indi- 
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Top 10 Snort Events (last 24 hours) 


Signature | Maimber | 


snort: “SQL sa brute force failed login unicode attempt" 56932 54.55% 
“MS-SQL SA brute force login attempt TDS v7/8" 23851 22.85% 
Stream5S: Bad segment, overlap adjusted size less than/equal 0 8357 8.01% 
StreamS: Limit on number of overlapping TCP packets reached 7805 7.48% 
Stream5S: Limit on number of consecutive small segments reached 2325 2.23% 
MISC MS Terminal server request 1519 1.46% 
ICMP PING NMAP 1271 1.22% 
ssh: Protocol mismatch 801 0.77% 
Stream5S: Data sent on stream not accepting data 308 0.3% 
StreamS: Packet missing timestamp 305 0.29% 


Bottom 10 Snort Events (last 24 hours) 


Signature | Naber | 


snort: “SPECIFIC-THREATS ASN.1 constructed bit string’ I 0% 
WEB-IIS WEBDAV nessus safe scan attempt 7 0.01% 
ICMP Source Quench 7 0.01% 
ICMP L3retriever Ping 15 0.01% 
Snort: “SHELLCODE base64 x86 NOOP" 19 0.02% 
ICMP Destination Unreachable (Communication with Destination Host is 25 0.02 
Administratively Prohibited) 

ftp_pp: Invalid FTP command 28 0.03% 
“POLICY RDP attempted Administrator connection request" 29 0.03% 
StreamS: TCP Timestamp is outside of PAWS window 30 0.03% 
ICMP Destination Unreachable (Communication Administratively Prohibited) 34 0.03% 
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Figure 10: Top and bottom 10 Snort signatures 
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Figure 9: Snort events graph 


cates attempts to upload a malicious shellcode. 


In the following example, the Snort IDS alerts show a 


possible injection of malicious code on an emulated Web 
server: 


04/15-06:49:15.474819 [**] [1:12799:3] SHELLCODE base64 x86 NOOP [**] 
[Classification: Executable Code was Detected]... {TCP} a.b.c.d:15017 -> W.X.Y.Z.:80 

04/15-06:49:15.474819 [**] [1:12802:3] SHELLCODE base64 x86 NOOP [**] 
[Classification: Executable Code was Detected]... {TCP} a.b.c.d:15017 -> W.X.Y.Z.:80 

04/15-06:49:15.619028 [**] [1:12800:3] SHELLCODE base64 x86 NOOP [**] 
[Classification: Executable Code was Detected]... {TCP} a.b.c.d:15017 -> W.X.Y.Z.:80 


The injection was successful and Nepenthes captured 
and logged the malware submission: 


[2011-04-15T06:49:19] a.b.c.d-> W.X.Y.Z. ftp://1:1@a.b.c.d:21/Rewetsr.exe 
c511c4£9bdd3bb892e582fbc9a00da9c 


4 Case Study 


This section details the UMD honeynet, the honeypot 
network deployed at the University of Maryland and also 
describes how DarkNOC is used to operate and maintain 
this particular network. 


4.1 UMD Honeynet 
4.1.1 Introduction 


The honeypot network hosted at the University of 
Maryland was initially built in 2004 with unused IP 
addresses of the campus network. More recently, other 
organizations joined the initiative: AT&T Labs, the 
University of Illinois at Urbana Champaign, and the 
Laboratoire d’ Analyse et d’Architecture des Systémes 
(LAAS) in Toulouse, France. Each of these organiza- 
tions contributes to the UMD honeynet by providing 
ranges of public IP addresses. 


The objective of the UMD honeynet is to provide 
the infrastructure to support honeypot-based experi- 
ments. The network features a centralized data collection 
and guarantees a realistic but controlled and flexible en- 
vironment to safely deploy experiments. The advantages 
of the present architecture are multiple: 


e A single gateway collects and stores the stores Snort 
events, flow data and network traffic, providing vis- 
ibility across the full range of exposed networks. 


e The experiments are easy to deploy without the 
need to create tunnels or to setup specific network 
configurations. 


e The UMD honeynet is scalable, new organizations 
can join the project by providing range of IP ad- 
dresses. 


4.1.2 Architecture 


Figure 11 shows the current architecture of the UMD 
honeynet and the different institutions involved in the 
project. A tunneling program called Honeymole? redi- 


>http://www.honeynet.org.pt/index.php/HoneyMole 
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rects silently the traffic from the different organizations 
to the UMD honeynet. 
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Figure 11: UMD honeynet architecture 


The complexity of managing and monitoring such a 
network was the primary motivation for the development 
of DarkNOC. This section will discuss the application of 
the tool to that problem. 


4.2 UMD DarkNOC Implementation 
4.2.1 Subnet Status 


The subnet status section is specific to the UMD 
honeynet. Each organization involved in the UMD 
honeynet provides one or more ranges of IP addresses 
called subnets. For example, the University of Maryland 
provides two distinct subnets: a subnet of the campus 
internal network and a subnet at the border network. The 
failure of a Honeymole tunnel is a significant event for 
the network, as it implies loss of an entire subnet; the 
subnet status display allows a manager to quickly assess 
the status of the tunnels and act on any issues. 


Each subnet hosts a low interaction honeypot run by 
Nepenthes to collect malware. Depending on the net- 
work configuration, a Honeymole tunnel may be estab- 
lished to redirect the traffic to Maryland. DarkNOC mon- 
itors the quantity of malware collected, the status of the 
Honeymole tunnels, and the status of the low interaction 
honeypots. 


4.2.2 Compromised Honeypots Detection 


Some experiments deployed on the UMD honeynet may 
present significant risks. In the likely event of a honeypot 
being compromised, the attacker may use the machine to 
attack other hosts on the Internet. These attacks are gen- 
erally easily detectable: Figure 12 shows that the volume 
of outgoing traffic is substantially greater than the incom- 
ing traffic. In this case, a honeypot was used as a proxy 
server. 
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Figure 12: Network traffic (04/18/2011) 


4.2.3 Traffic Anomaly Detection 


A current experiment uses a known-vulnerable SSH 
server running on about 80 IP addresses of the Internet 
subnet provided by the University of Maryland. The Dar- 
kNOC’s summaries proved useful in analyzing an attack 
on this configuration of the network which occurred on 
June 3, 2011. 
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Figure 13: 06/03/2011, number of Flows 


1. Figure 13 shows an increase in the number of flows 
just before midnight on Thursday night. 


2. The number of attackers presented in Figure 14 re- 
mains relatively steady. This suggests that a fixed 
set of attackers is generating a large volume of traf- 
fic; 


3. Figure 15 shows that port 22 is very active. As SSH 
sessions do not usually generate many flows, we can 
assume that the attacker is using a bruteforce attack 
against several IP addresses hosted within the UMD 
honeynet. 


4. The word cloud of the honeypots targeted showed 
that the IP addresses of this specific SSH experi- 
ment were targeted. 


DarkNOC provided several indications on the nature of 
the attack responsible for the spike in traffic network and 
flows. That night, the health monitoring system of the 
experiment reported several times that the machine was 
overloaded and the SSH server failed. 


4.2.4 Using Honeypots as a Security Tool 


Compromised Hosts Detection 
The network traffic observed within an honeypot net- 
work is considered malicious. A healthy host would 
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Figure 14: 06/03/2011, number of attackers 
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UDP / 19756 5550 1.33% 
TCP / 3306 5033 1.2% 
TCP / 3389 4513 1.08% 


ICMP / 8.0 4286 1.02% 











Figure 15: 06/03/2011, top 10 destination ports 


not normally communicate with the honeypots. We can 
therefore use the UMD honeynet to detect compromised 
hosts on the Maryland campus network. We assume that 
if a computer on campus appears in the flow data, that 
means the host is compomised. The alerting module 
queries the flow data to identify these hosts. This method 
is efficient at detecting scanners: the use of subnets from 
both local and remote sites means that a scanner is likely 
to eventually visit the UMD honeynet whether its probes 
are directed locally or at the Internet. 


When a compromised machine is detected, the alerting 
module analyzes the event and generates an email that is 
sent to the IT Security Officer for further analysis. Fig- 
ure 16 is an example of such a report. For each host, the 
number of flows, packets and bytes are provided. The 
report is also available on the Web interface of Dark- 
NOC, it is possible to vizualize the flows associated with 
the alert. This technique helps to identify compromised 
hosts and misconfiguration as well. When this alerting 
system was first launched, the IT team figured that even 
if a host was tagged as blocked in their systems the com- 
promised host was still able to communicate on the net- 
work and to continue its malicious activity. The analysis 
is performed every 12 hours and each participating or- 
ganization gets notified of the eventual compromises of 
their systems. The choice of running the analysis at this 
frequency was chosen based on the feedback provided 
by the security team of the University of Maryland. The 
team wanted to receive a report early in the morning and 


SSS SSS SSS Sees Alla lysis Repory. =—S-=S=ssSS-ss= 
Flow Time Window: 2011/06/06.06:00:00-2011/06/06.18:00:01 
Number of hosts detected: 3 
To access the online version of the report: 
https: //xxx.xxx.xxx.xxx/darknoc/alert_hosts.php?report=263 


ExX-XKK exe. xxx CA. umd. edu) 
- Number of flows: 1 
- Number of packets: 1 
- Number of bytes: 51 
To visualize the flows: 
https: //xxx.xxx.xxx.xxx/darknoc/alert_hosts.php?id=1124 


yyy -yyy-yyy-yyy (Y.umd.edu) 
- Number of flows: 10 


- Number of packets: 10 
- Number of bytes: 1915 

To visualize the flows: 
https://xxx.xxx.xxx.xxx/darknoc/alert_hosts.php?id=1125 


222.222.222.222 (Z.umd,edu) 
- Number of flows: 10 
- Number of packets: 10 
- Number of bytes: 1915 
To visualize the flows: 
https: //xxx.xxx.xxx.xxx/darknoc/alert_hosts.php?id=1126 


Figure 16: Alerting module report 


right after business hours. 


Security Profiling 

Honeypots can provide relevant information regarding 
attackers and their techniques to compromise a computer. 
DarkNOC brings together enough information from dif- 
ferent datasets to establish a security profile of a network. 
This profile includes the services targeted, the number of 
malware uploaded and the types of attacks. The objective 
is to help the security officers and network administrators 
to understand where to focus their efforts and to identify 
weaknesses and misconfigurations. DarkNOC can also 
be used to evaluate the performance of the security policy 
in place. The attacks detected and the malware uploaded 
on the honeypots are good indicators of the efficiency of 
an IPS device. 

Attack techniques are constantly evolving as new vul- 
nerabilities are discovered regularly. The honeypots can 
help to identify the current trends and to update the secu- 
rity policy accordingly. 


5 Related Work 


Lance Spitzner defines honeypots as a security tool 
whose value lies in being probed, attacked, or compro- 
mised [21]. In other words these are highly monitoring 
computer systems meant to attract hackers, analyze their 
modus operandi and profile them [19]. Placed in pro- 
duction environments, honeypots take an active part in 
the security of a network by providing information on 
attackers and attacks’ patterns. Niels Provos introduces 
two types of honeypots [18]: high interaction honeypots 
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that involve the deployment of real operating systems on 
real or virtual machines, and low interaction honeypots 
that are computer software emulating operating systems 
and services. 

Companies and researchers currently deploy honey- 
pots networks at different scales. Also known as hon- 
eynets, these honeypots networks can be limited to few 
IP addresses on the local network or distributed systems 
in several locations such as the Leurre.com project [16], 
the Internet Motion Sensor [4], SGNET [13] or the hon- 
eynet initiative from CAIDA [23]. 

Levine et al. demonstrated the usefulness of deploy- 
ing honeypots accross large enterprise networks [14]. In 
their study, Snort [20] was used to detect compromised 
computers accross Georgia Tech network. In DarkNOC 
a similar detection has been made possible by using the 
flow data. We assume that any traffic seen on the honey- 
pot network is malicious. 

The visualization and data analysis of malicious net- 
work activity has been the focus of a variety of commer- 
cial and open source products. On the commercial side, 
security companies such as Tenable and Sourcefire offer 
threat management products that collect logs from mul- 
tiple devices and generate alerts to inform security ana- 
lysts about potential intrusions. The main limitation of 
these solutions with respect to our goal is that they are 
not tailored to honeypot management and honeynet data 
collection and so they require additional effort to inte- 
grate honeypots in the organization security data analy- 
sis suite. Arbor Network is another commercial security 
vendor that offers a threat management product but the 
difference with the previous solutions is that they lever- 
age their customer networks to instrument dark IP space 
at a large scale. As a result, they offer a global view 
of malicious network activity through their Atlas portal‘, 
which provides functionalities similar to DarkKNOC, with 
graphs and tables for top attacks, top threat sources and 
attack trends. 

On the open source side, the main honeynet manage- 
ment solution has been Honeywall [8] developed by the 
Honeynet Project. The Honeywall is a bootable CD-Rom 
that installs a Linux-based network gateway to manage 
and control honeypots as well as visualizing and analyz- 
ing honeynet logs. Compared to DarkNOC, Honeywall 
has a more capabilities to actively limit outgoing traffic 
but it has been designed for small honeypot network. The 
data processing capabilities of Dark NOC were designed 
for large scale and multi-site deployments. The objec- 
tive of the DarkNOC project is to provide a flexible and 
powerful analysis program. It is adjustable to fit differ- 
ent honeypots configurations. However Honeywall is a 
all-in-one solution for small scale honeypot networks. It 


*http://atlas.arbor.net 
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provides routing, capture and analysis capabilities. In- 
tegrating Honeywall in an existing large-scale honeypot 
network is more challenging. 

Other open source projects that are not specifically tai- 
lored for honeypots include Alienvault [7], Aanval>, Nf- 
sight [6] and NVisionIP [12]. Alienvault and Aanval are 
network and system log management solutions that can 
only process Snort alerts and syslog events while Nfsight 
works exclusively with Netflow and has been designed 
for large-scale processing and security visualization of 
Netflow. NvVisionIP processes global network Netflow 
data to specifically detect attacks and misuses. 

Visoottiviseth et al. present a distributed honeypot 
framework using low interaction honeypots [22] running 
the honeyd daemon [17]. More specifically, they de- 
scribe the working of the honeyd logs centralization and 
their analysis [22]. The framework only works with Hon- 
eyd log files. The level of interaction of our framework is 
also different since we are running low interaction hon- 
eypots as well as high interaction honeypots. 


6 Future Work 


We are working on a number of extensions and improve- 
ments on DarkNOC. The first extension will be the addi- 
tion of a malware section in the user interface. This new 
section will provide more information about the malware 
collection including a graph showing the number of up- 
loads per day but also some indications on the methods 
used to upload the malicious software and its name. The 
second improvement will be the implementation of the 
automatic detection of compromised honeypots in the 
alerting module. This detection will allow DarkNOC 
to automatically block the outbound traffic of compro- 
mised honeypots. Currently, only the detection of com- 
promised non-honeypot hosts of an organization is au- 
tomated. The graphical user interface of DarkKNOC can 
also be enhanced. There is no option that allows to select 
and display the activity of a specific period of the day. It 
would be useful to be able to choose on a graph a partic- 
ular moment of the day and see the activity at this precise 
time. 


7 Conclusion 


In this paper we presented DarkNOC, a honeypot net- 
work management and monitoring tool. DarkNOC pro- 
vides a summary of the activity of the honeypots in 
the network. This summary is generated from different 
sources of data including Netflow, malware collected by 
the Nepenthes low interaction honeypots and attacks de- 
tected by the Snort intrusion detection system. Brought 
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together, these data sources provide important resources 
to help network administrators, security teams, and se- 
curity researchers understand attacks and protect sys- 
tems. DarkNOC can be used the detect traffic anoma- 
lies and identify interesting case study for research pur- 
poses. Since it is important to detect quickly any com- 
promised honeypots in the honeynet, DarkNOC provides 
administrators of these networks information regarding 
the health of the systems. Security teams may find a par- 
ticular interest in DarkNOC since it can be used to detect 
compromised honeypots as well as compromised hosts 
on their non-honeypots networks. To sum up an organi- 
zation using DarkNOC can have a better understanding 
of: 


e the most targeted systems, 

e the attackers, the attacks and their origin, 
but also, Dark NOC helps: 

e to obtain an overview of Honeynets activity, 


e to identify security tools and devices misconfigura- 
tion. 
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Abstract 


Avatar is a new architecture devised to perform on- 
the-fly malware analysis and containment on ordi- 
nary hosts; that is, on hosts with no special setup. 
The idea behind Avatar is to inject the suspected 
malware with a specially crafted piece of software at 
the moment that it tries to download an executable. 
The special software can cooperate with a remote 
analysis engine to determine the main characteris- 
tics of the suspected malware, and choose an appro- 
priate containment strategy, which may include pro- 
cess termination, in case the process under analysis 
turns out to be malicious, or let it continue other- 
wise. Augmented with additional detection heuris- 
tics we present in the paper, Avatar can also perform 
signature-less malware detection and containment. 
Keywords: system security, malware detection 
and containment 


1 Introduction 


In the last half-decade, malware has evolved from 
a “hobby” for bored programmers to a business for 
cyber-criminals, who infect computer systems on a 
large scale to carry out illegal activities [20]. Bot- 
nets are a typical example of such business, and can 
be exploited to collect financial/sensitive user infor- 
mation. As noticed by Kolbitsch et al. [13] “mali- 
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cious code, or malware, is one of the most pressing 
security problems on the Internet”. Malware con- 
tainment has thus become an urgent concern. Re- 
cent events, such as the RSA breach back in March 
2011 [17], have shown that serious attackers employ 
ad hoc malware in multi-stage attacks to penetrate 
corporate networks and get hold of business-critical 
information. 


Successful malware containment is based on two 
activities: Detection and analysis. 


Detection. Concerning detection, the standard 
mechanisms employed against malware are based on 
signatures. Antivirus software and intrusion detec- 
tion systems (both host- and network-based) rely 
on some sort of byte-matching techniques (either 
pattern- or hash-based) to detect the presence of ma- 
licious programs. ‘To evade signature-based detec- 
tion, malware writers can and do obfuscate the code 
using e.g., polymorphism, packing, encryption [4]. 
The result of the massive application of evasion tech- 
niques is that in the past few years the number of 
unique malware samples, and relative signatures, has 
increased dramatically. In Section 4 we discuss in 
more detail some of the latest results in signature- 
based malware detection. 


Analysis. To understand how malware works, and 
to improve the crafting of detection signatures, re- 
searchers have developed several frameworks for au- 
tomating dynamic malware analysis (e.g., Anubis [1], 
CWSandbox [7], Malheur [14]). These tools monitor 
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the behaviour of a malware sample which is being 
executed in a severely controlled environment, and 
produce a detailed report of the operations it car- 
ries out (e.g., access/modifications to files, network 
activities, process execution, etc.). 

Dynamic malware analysis is undoubtedly effec- 
tive; however, it requires a specific analysis environ- 
ment, which cannot be just any computer. Moreover 
— as almost all security techniques — it is not infal- 
lible: Among the possible evasion techniques, it is 
becoming a common practice for malware to check 
whether the execution takes place in a virtualized 
environment, which likely indicates the executable is 
being monitored [2]. Secondly, as reported by Com- 
paretti et al. [6], some malicious behaviors, such as 
the so called “dormant functionalities”, may remain 
long unobserved, for instance when they depend on 
circumstances which are hard to guess and to repli- 
cate dynamically. 


Summarizing, current detection and analysis ap- 
proaches suffer from the following limitations: 


e (Existing) dynamic malware analysis approaches 
can only perform post-mortem, or offline, analy- 
sis of the malware sample, once it has been col- 
lected and submitted: Hence they lack the ex- 
ecution context information; moreover, they re- 
quire specific setups. 


e Detection and containment are based on signa- 
tures or behavioral models, and are therefore ef- 
fective only for those samples for which an ap- 
propriate signature/model has been developed. 


e The most effective approaches rely on the pres- 
ence of an agent on the end-host to monitor sys- 
tem activities; such extra software component is 
invasive, might affect system performance, and 
cause additional burden because system admin- 
istrators must plan carefully its development and 
maintenance. 


In particular, security analysts do not get the chance 
to analyze and contain on-the-fly suspicious pro- 
erams. 

One would like to have in addition to standard 
tools a first line of defense against malware that does 
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not require special settings for the host, nor pre- 
deployed signatures. Similarly to what happens with 
intrusion detection systems, and especially for large 
corporations, one could think of a security operation 
center (SOC), where security analysts are able to in- 
spect on-going suspicious behaviours. ‘Thus, auto- 
matic analysis tool could be employed to “select” sus- 
picious programs for analysis, which would be then 
carried out with a mix of automatic and manual in- 
spections. 


Contribution In this paper, we present a novel ap- 
proach to perform on-the-fly malware analysis and 
containment for large networks, without having to 
deploy any end-host component beforehand. Our ar- 
chitecture, we call it Avatar, relies on the observa- 
tion that malware distribution is usually done in at 
least two phases: First the computer is infected with 
a tiny “spore”, then in following phases this spore 
downloads one or more additional components from, 
for instance, some earlier compromised web servers. 
Those components, or “eggs”, are used to extend the 
malware capabilities, e.g., hooking system APIs to 
erab user passwords, and usually come in the form 
of executables, or dynamic libraries. By doing so, 
malware writers can more easily avoid detection. 

Our approach is based on the injection of “good- 
ware” in the suspected malware: In the moment that 
the alleged malware attempts to dowload an egg, we 
substitute the egg with the goodware, we call it the 
cuckoo’s egg. This is an executable that — among 
other things — can carry out preliminary malware 
analysis, can terminate the malware or it can sim- 
ply give the control back to the egg if the suspected 
malware turns out to be a legitimate program’. The 
current implementation of Avatar is meant to moni- 
tor Windows-based systems. 

This is done without any special setup in the host 


'Similarly to the cuckoos that engage in brood parasitism, 
our goodware is expected to circumvent the malware and take 
advantage of it for performing the analysis 

7In some cases it may be illegal to inject in an application 
software other than the one meant to be downloaded. Avatar 
is meant to be deployed in corporate networks, where system 
and network administrators are (usually) allowed to monitor, 
and limit, users’ actions. 
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that contains the suspected malware, which may be 
just any computer running any Windows operating 
system. Indeed, the cuckoo’s egg can be generated 
and inoculated from the firewall, and the analysis 
can be done on a remote analysis engine to which the 
cuckoo’s egg communicates after it has been injected 
in the host under analysis. 

Our experiments show that this is all possible, and 
that the cuckoo’s egg can, for instance, be designed 
to inspect the process that executes it after the down- 
load, or to send to Avatar’s remote analysis engine 
information regarding the process, such as path on 
the file system, file handlers, network/registry activ- 
ities, or even the executable itself. Depending on 
the current user’s permissions, the malware analysis 
engine can even “order” the cuckoo’s egg code to sus- 
pend or terminate the process, effectively containing 
a possible larger infection. 

An important side-issue is when should one start 
being suspicious about a given process. In other 
words, when should the system suspect that a spore 
is actually trying to download an egg. For our exper- 
iments we have developed a heuristic method which 
works as follows: Malware is usually programmed to 
use several different download servers, as servers are 
often offline/discontinued. In practice, the spore of- 
ten fails a number of times before succeeding in down- 
loading the egg. Thus, we take into consideration 
per-host failed TCP connections and failed HTTP re- 
quests to identify malware attempts of downloading. 
A number of failed HTTP requests is a good indi- 
cation of the presence of malware. Our experiments 
show that this method is surprisingly effective. How- 
ever, one can devise other heuristics which may be 
applicable in other contexts. It is outside of the scope 
of this paper to make an inventory of such methods. 

To the best of our knowledge, this is the first ap- 
proach which — without the installation of any addi- 
tional plug-in before hand — allows one to: 


e (analysis) carry out on-the-fly remote analysis 
of a suspicious program; 


e (containment) suspend or terminate the suspi- 
cious program directly on the infected host; 


e (detection) in combination with the heuristics 
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for detecting suspicious downloads, it can iden- 
tify suspicious malware processes which can be 
immediately analyzed and contained if required. 


We should remark that this is done without using 
signatures of any kind. Therefore, this approach can 
be used to detect, analyze and contain also zero-day 
malware and malware for which there is no signature 
available yet. For example, one could even think of a 
“paranoid” mode, in which a cuckoo’s egg is shipped 
for each download of executables regardless the rate 
of failed connections. 

We show that Avatar is effective as a lightweight 
first line of defense against malware, also allowing 
to do malware containment on hosts with no spe- 
cific pre-deployed tools (agent-less). This is a crucial 
requirement for system administrators of large net- 
works, as it eases the burden required to install ad- 
ditional software to perform an accurate monitoring. 

It is important to stress that this approach can be 
adapted to work with any protocol, in our embodi- 
ment we choose HTTP because it is widely used by 
malware writers. Of course, this approach has limi- 
tations, and can be countered to some extent. These 
aspects are discussed in Section 2.6. 


2 Architecture 


The architecture of Avatar consists of three main 
parts. The download detection engine (DDE) is re- 
sponsible for detecting suspicious attempts to down- 
load software components. ‘The Cuckoo’s Egg Gen- 
erator (CEG) is responsible for crafting the special 
analysis software that will be sent to requesting host. 
Finally, the Malware Analysis Engine (MAE) is re- 
sponsible for analysing the information provided by 
the injected cuckoo’s egg and possibly initiate some 
containment strategies. We now provide a detailed 
description of each component. 


2.1 Download Detection Engine 


The download detection engine (DDE) detects 
(failed) download attempts that might be due to mal- 
ware activity. Strictly speaking, the functioning of 
the DDE is orthogonal to that of the analysis and 
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Figure 1: The Avatar architecture. 


containment engines of Avatar, which on the other 
hand, are the core of the system. In fact, Avatar 
would work just as well also in combination with 
any other method one could devise to spot out suspi- 
cious download attempts. Nevertheless, it is easier to 
explain the whole architecture by starting from the 
DDE. 

Our DDE is based on the fact that often malware 
fails a number of time to download eggs. This is due 
to the fact that download servers are often offline 
and/or taken down by security officers. In our em- 
bodiment the detection engine combines a modified 
version of the Threshold Random Walk (TRW) al- 
gorithm [10]. The engine builds a per-host model of 
normal usage, which takes into account the number 
of failed connections, and failed HTTP requests. In 
the case of malware, the former situation can occur 
when, e.g., the remote web server has been deacti- 
vated, the latter because the malicious content has 
been removed. As confirmed by our tests (see Sec- 
tion 3.1), these are not infrequent events. The result- 
ing algorithm is simple, albeit effective, and could be 
easily expanded to include additional sources of in- 
formation (e.g., DNS queries). 

The DDE may be located at the network “border” 
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with the Internet, in order to observe any outgoing 
connection and the data sent back by the remote host. 
As we said, while this component plays an impor- 
tant role in our approach, it is not the main driver 
of our idea. For instance, one could decide to inspect 
any executable download from the Internet, without 
the host having to failed a number of connections or 
HTTP requests before being flagged as suspicious. 

The 'TRW algorithm is devised to detect scanning 
behaviours originating from a specific host in a mon- 
itored network. For each host a detection model is 
built. The outcome of a connection attempt is either 
“success” or “failure”. After a number of observa- 
tions of connection attempts for a certain host h, one 
would like to know if h is a scanner. To make such 
decision, a sequential hypothesis testing method is 
used. ‘The basic premise is that there exists a dis- 
tinct fixed ratio of failed and successful connections, 
and that this ratio is different when a host is a scan- 
ner. Furthermore, for each individual host this ratio 
value will eventually converge to some upper or lower 
boundaries, based on whether the host is a scanner 
or not. 

We have adapted the TRW algorithm to take into 
account also successful and failed HTTP requests: 
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We currently employ only one model for both TCP 
connection and H'I'T’P requests. 


2.2 Cuckoo’s Egg Generator (CEG) 


Once the DDE identifies a suspicious download at- 
tempt, the CEG generates a specific executable/DLL 
to be fed back to the suspicious host. We now de- 
scribe in details the purpose of the cuckoo’s egg and 
its “internals” . 


2.2.1 The cuckoo’s egg 


The main goals of the cuckoo’s egg are I) to gain 
as much knowledge as possible about the executing 
process, that, in case of malware, is usually the pro- 
cess that tried to download the “egg” and received 
the cuckoo’s egg instead of it, and II) to take control 
over the parent process if necessary. ‘The cuckoo’s 
ege operates in two stages. 

First, the cuckoo’s egg “inspects” the execution en- 
vironment. The reason for this is that different oper- 
ating systems allow processes to execute certain op- 
erations with or without high privileges. ‘Therefore, 
the cuckoo’s egg may be allowed to perform only a 
restricted set of operations. For instance, beginning 
with Windows Vista, Microsoft includes a User Ac- 
cess Control (UAC) mechanism. The system can be 
set to notify the user when a process is about to mod- 
ify some important system settings or execute poten- 
tially dangerous operations, so that the user can give 
explicit authorization. Because we want the cuckoo’s 
ege to be as transparent as possible (for usability 
reasons), on Vista and later OSes, we cannot use a 
number of features, such as debugging mode, as these 
could (possibly) trigger the UAC. 

The cuckoo’s egg attempts to inject a specifically 
crafted DLL into its parent process with different ac- 
cess rights: The parent process can restrict the op- 
eration set the child process is allowed to perform. 
The different combinations of access right masks 
the cuckoo’s egg uses are: PROCESS_ALL_ACCESS 
(highest privileges), TTERMINATE_-PROCESS | | 
QUERY INFO | READ, QUERY_INFO | READ and 
TERMINATE_PROCESS (lowest privileges). 
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Secondly, the injected DLL extracts, if allowed to, 
some information from the parent process (depend- 
ing on the operational mode, see Section 2.4). This 
information includes: Full path, executable size on 
disk, DLLs that have been loaded, and information 
related to the current window attached to the pro- 
cess (if any), such as handle, size and caption text. 
At this stage, the cuckoo’s egg’s DLL attempts to 
determine quickly whether the parent process is ma- 
licious or not, and employs initially some heuristics 
based on the data above. Our experiments show 
that in most cases, one could tell straight after these 
heuristic checks whether the parent process is likely 
to be malware. For instance, a large executable size 
(more than 5 MB) is a sign of a non-malicious pro- 
cess: Malware writers tend to reduce the size of the 
“spore” to by-pass more easily anti-malware coun- 
termeasures. Similarly to [13], we also whitelist ap- 
plications that could perform a licit download and 
later execute the downloaded file (e.g., Internet Ex- 
plorer, Windows Update). Some limitations apply 
to these heuristics, and we discuss them in details in 
Section 2.6. An additional heuristic one might think 
to apply is the approach presented in [18], based on 
PE header analysis of suspicious programs. 

If the heuristics do not indicate that the process 
is legitimate, then the information is passed to the 
MAE (discussed below) for remote analysis. Then — 
depending on the operational mode set and the user 
access rights — the cuckoo’s egg can I) debug the par- 
ent process, IT) let it run normally, III) “freeze” it, 
and, as a very last countermeasure, IV) terminate it. 

In the first case, the cuckoo’s egg can send back to 
the highly-instrumented malware analysis engine the 
debugged instructions. By doing so, we can “reply” 
on the remote analysis engine any operation and set 
whether we are debugging a malware process. How- 
ever, our experiments show that this approach col- 
lects very little useful information on the parent pro- 
cess, as the malign process usually executes the egg(s) 
as the very last step of its run. 

In the second and third cases, the cuckoo’s egg 
sends back to Avatar the parent executable, and this 
is also the reason why we need to collect the parent’s 
full path. By sending the whole executable, we can 
restart from scratch the process execution within our 
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monitored environment and off-load a more accurate 
analysis. 


2.2.2 Packaging the cuckoo’s egg 


Once the DDE notifies the CEG, the latter has to 
generate a suitable cuckoo’s egg for the target and, 
depending on the operational mode, “attach” the 
original executable to the cuckoo’s egg. We distin- 
guish two cases, depending on whether the original 
executable is available, because it could be down- 
loaded, or not. 

As we mentioned earlier, the original executable 
requested by the host may not be available. In this 
case, only the cuckoo’s egg is sent back to the tar- 
get without any further processing. If, on the other 
hand, the originally requested executable is available, 
and the operational mode allows to do so, the CEG 
“forces” first the execution of the cuckoo’s egg, and 
then of the “real” executable. Hence, the main con- 
cern when shipping the cuckoo’s egg is to preserve 
the egg’s functionalities as much as possible. There 
are several ways to achieve this, two of which are 
discussed here, each preserving the functionality in a 
different way: 


e injecting a DLL loader stub through Portable 
Executable injection; 


e shipping a replacement-executable that fetches 
and executes the egg after the parent process 
has been analyzed. 


In the first case, the Portable Executable (PE) file 
header of the downloaded egg is altered. The PE for- 
mat [15] is a file format for executables, object code, 
and DLLs, used by Windows since early NT versions. 
When an executable is launched, the system process 
loader uses the information included in the PE header 
to carry out operations such as: Filling in-memory 
data structures, loading required DLLs, and jumping 
to the entry point of the executable. In this case, 
the CEG appends the cuckoo’s egg to the egg’s ex- 
ecutable file, next the ege’s Entry Point is modified 
to point to a loader stub that will unpack the engine 
and write it to a file after which it will be loaded 
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like any regular DLL. This method is rather com- 
plex, it presents the disadvantage that it might trig- 
ger the antivirus (unless some packing techniques are 
used) and requires the LoadLibrary and GetProcAd- 
dress offsets to be available in the egg’s PE header, 
which is usually the case, though. 

The second method is much simpler, requires no 
modifications to the egg’s executable file, is usually 
not flagged as malicious by the antivirus and does 
not make any assumptions on the egg’s PE header. 
A stand-alone cuckoo’s egg is sent back and, once 
the analysis is over, it downloads and executes the 
“original” egg. ‘The downside to this approach is 
that any relation that the egg may want to set up 
with its parent is lost. Moreover, this could signif- 
icantly slow down the execution, by introducing an 
additional download latency. 


2.3 Malware Analysis Engine 


The MAE is the core component of the Avatar archi- 
tecture. It is responsible for analysing the informa- 
tion sent by the cuckoo’s egg. If necessary, it should 
run the suspected executable in a protected environ- 
ment. From a functional point of view, it does not 
differ from other malware analysis tools. Once the 
sample to analyse is received, it is executed and any 
operation performed is recorded and logged. The ex- 
ecution report can be then dispatched to a security 
analyst, who can set a final verdict about the ma- 
liciousness of the sample, in case the executed pro- 
eram’s nature remains unclear. 

The MAE is also used to store information about 
whitelisted programs, which the cuckoo’s egg will 
consider as non-suspicious. By doing so, we can basi- 
cally centralize our architecture, making it possible to 
“update” crucial information about malware in one 
step. 


2.4 Operational modes 


As networks, and hosts, require different confidential- 
ity and availability levels, users need to control the 
way the cuckoo’s egg could affect the execution of 
processes. As in the case of all detection and preven- 
tion systems, false positives are always possible, so 
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one has to find an appropriate compromise between 
rigorous containment, at the risk of terminating a le- 
gitimate process, and less drastic measures. In our 
embodiment, we have implemented three basic oper- 
ational modes. 


Transparent mode When in this mode, the DDE 
notifies the CEG about the failed attempts to pull 
down some files from an external server. The CEG 
then waits for the file to be actually downloaded, and 
verifies it is an executable. If so, the CEG crafts a 
cuckoo’s egg with the original file appended. Once 
the execution of the cuckoo’s egg is over, the origi- 
nal file is automatically executed. The cuckoo’s egg 
sends back to the MAE a copy of the parent execut- 
ing it for analysis. No further action is possible on 
the suspicious host, as the cuckoo’s egg releases the 
parent process’ executable. This mode does not in- 
terfere with regular operations of the suspicious host, 
as the original requested file is executed. 


Semi-transparent mode ‘This mode differs from 
the transparent mode as follows. The original file is 
downloaded and attached to the cuckoo’s egg. How- 
ever, when the cuckoo’s egg is executed, it freezes 
the parent process. ‘Then, the cuckoo’s egg runs the 
heuristics checks and might decides to “release” its 
parent process immediately. If the heuristics checks 
cannot clearly determine the nature of the parent 
process, the cuckoo’s egg ships a copy of the par- 
ent process’ executable to the MAE. Then, it waits 
for further commands from the MAE. Further com- 
mands may include the termination or release of the 
process. This mode might interfere with the regu- 
lar operations of the suspicious host, as the parent 
process is frozen while the analysis is in progress. 


Non-transparent mode When in this mode, the 
CEG is notified about the failed downloads, but, pro- 
vided the requested filename points to an executable, 
does not wait for the original file to be successfully 
downloaded. Instead, it immediately ships a cuckoo’s 
ege. Based on the heuristic checks, the cuckoo’s egg 
might send back to the MAE a copy of parent exe- 
cutable, and waits for further commands. This mode 
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heavily interfere with the regular host operations, as 
the requested file is not executed. 


2.5 Implementation 


To carry out our experiments we have implemented 
a proof of concept version of Avatar. The three main 
components of Avatar can be placed at different lo- 
cations on the network. However, in our experiments 
we have coupled the DDE and the CEG together into 
a single host. The reason for this is that the DDE 
and CEG must exchange information about failed 
downloads, and the CEG must craft and supply the 
cuckoo’s egg in a timely manner. The deployment 
of these two components on physically separated sys- 
tems might introduce delays that could impact the 
analysis. 

In practice, to allow a transparent deployment that 
does not require any reconfiguration at host side, we 
employed a single Linux box with built-in firewall 
and web proxy. The firewall transparently redirects 
the outgoing traffic directed to common HTTP ports 
(TCP ports 80 and 8080) through the web proxy, 
which can inspect both request and reply. Thus, no 
re-configuration of client hosts is required. As fire- 
wall, we use Netfilter, the Linux sub-component in 
charge of managing network communications. Netfil- 
ter offers the possibility to insert specific “hooks” in 
its packet process workflow, so that it is possible to 
inspect, and even modify, on-the-fly any packet pass- 
ing by. To inspect of HTTP traffic, we set up a web 
proxy based on Apache. Apache supports modules 
for adding new functionalities, and we have developed 
a new module to inspect requests and their content. 
Internally, the module maintains a table that con- 
tains statistics about internal hosts and their connec- 
tion/request failure rates. The module also inspects 
the replies sent back by the remote (web) server. 

When the same host performs several failed con- 
nections in a given timeframe, or requests to pull 
down some file(s) do not end successfully, the Apache 
module marks that host as suspicious. Depending on 
the operational mode, the module will either wait un- 
til a request is successful, and then ship back a crafted 
cuckoo’s egg together with the original file, or it will 
immediately ship back a cuckoo’s egg (provided the 
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request points to an executable filename). 

If the requested file is eventually downloaded, the 
module proceeds with some sanity checks and verifies 
that the downloaded file is actually an executable. 
In case of a positive match, the executable is stored 
and the cuckoo’s egg crafted. To craft the cuckoo’s 
ege and append the requested file, we implement the 
first method presented in Section 2.2.2 (PE header 
injection), to avoid download latency. 

The analysis engine is implemented as a Windows 
kernel driver. In order to monitor malware activities, 
the driver hooks some APIs functions, and exploits 
the capabilities offered by the latest Windows OSes, 
which provide built-in sub-systems for third-parties 
antivirus and firewall software. These interfaces allow 
one to detect changes in the file system, the system 
registry, monitor network connections, etc. 

Technically speaking, the MAE resides on a real 
system behind a firewall, in order to prevent any out- 
going connection that could be initiated by the mal- 
ware once it is activated. The MAE does not run on 
any virtualized environment, to avoid possible built- 
in anti-analysis capabilities inside the malware. ‘This 
choice has the disadvantage of requiring a roll back to 
the original status after each analysis. We do not see 
this as a serious limitation because our current goal 
is not to speed up malware analysis, which would re- 
quire several concurrent systems. Nevertheless, the 
kernel driver can be deployed in a virtualized envi- 
ronment too. 

The cuckoo’s egg communicates with the analysis 
engine through encrypted network sockets. Encryp- 
tion is used to avoid leaks of any possible sensible 
information, e.g., a memory dump, over the network, 
and to prevent the spore from tapping our communi- 
cations. 


2.6 Limitations and evasion of Avatar 


In this section we discuss the limitations of our ap- 
proach. 


Limitations of the CEG When crafting the 
cuckoo’s egg, the original requested file can be at- 
tached to it. This process could break self-extracting 
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archives, which verify the file integrity before inflat- 
ing the content. 


Evading the DDE Our approach works by first 
detecting (failed) attempts to download additional 
components. If malware evades this detection phase, 
then Avatar cannot ship the cuckoo’s egg. To avoid 
detection, malware could initiate connections at a 
very low rate, as part of our detection relies on high 
rate of failed connections. Encrypting connections 
could be also a countermeasure against inspection. 


Evading the CEG Another possible way of evad- 
ing Avatar is by using some sort of verification mech- 
anism of the downloaded components. Encryption 
and hashing could be employed to detect a mismatch 
with the expected file. For instance, by compress- 
ing the executable and protecting the archive with 
a password. Because the sanity check performed on 
the downloaded file can be solely based on the magic 
numbers only, a malware writer could hide the exe- 
cutable within a different file type and change the file 
header at run-time, once downloaded. 


Evading the cuckoo’s egg Because the cuckoo’s 
egg employs heuristics to decide whether to continue 
the analysis or to send back to the instrumented host 
the parent executable for analysis, malware could 
take some countermeasures to evade the heuristics 
checks. For instance, since Windows 2000, a process 
can execute instructions within the context of an- 
other process by using the CrreateRemoteThread API 
function (a similar function allows the injection of 
DLLs). Thus, malware could inject arbitrary ma- 
licious instructions in the context of an accessible 
whitelisted process, e.g., Internet Explorer, which is 
usually executed with the same access rights the mal- 
ware has, to evade some checks performed by the 
cuckoo’s egg®. 


3It is worth noting that the very same technique could be 
used to evade approaches like the one presented in [13], which 
relies on the fact that some processes can be whitelisted before 
hand to avoid false alerts. 
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Possible Solutions Although we acknowledge 
that it is possible to devise malware with anti- 
analysis features tailored for our approach, we did not 
observe any of those during our experiments. More- 
over, the use of encryption of hashing for file verifi- 
cation would likely slow down the malware spread, 
as either “updated” versions would fail the check or 
researchers could reverse engineer some malware sam- 
ples and identify the encryption key/password of the 
mechanism. 

By the way, we think that malware writers might 
be reluctant in adding a verification step to the mal- 
ware, as it might simplify the work of signature-based 
detection system. In the moment that the malware 
is analyzed the key used for encryption would cer- 
tainly be identified, and this could be used to craft 
an effective signature for detecting it. 

A possible solution to the evasion of the cuckoo’s 
egg would be to add a comparison of the exe- 
cutable on the disk with the memory image, and 
pinpoint possible later-added instructions. However, 
this would require also to inspect DLLs, and the task 
could easily become infeasible (let alone not being 
bullet-proof). We plan to address in future work this 
issue. 


3 Benchmarks 


To validate the effectiveness of our approach, we use 
two different datasets. The first data set, referred to 
as DS, is available on request from the team that 
built Malheur. It contains a large collection of mal- 
ware samples that could be used for malicious pur- 
poses. In practice, the data set is a collection of sam- 
ples submitted in a period of eight consecutive days 
in 2009. Each sample has been analyzed by CWSand- 
box and the related report is included together with 
the original sample. ‘This data set is used to test the 
basic idea of our approach, that malware will execute 
an arbitrary generated “egg”. 

Our second dataset, DS, is a collection of mal- 
ware samples found in the wild. For some samples, 
no report was available beforehand (meaning they 
were brand new or modification of known malware 
samples). Hence, we had to submit the sample to 
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either Anubis or CWSandbox to learn whether the 
sample was actually malware and downloaded some 
extra components. With this data set we want to test 
in particular the effectiveness of the devised heuristics 
for triggering instrumented analysis of the suspicious 
process. 


3.1 Tests with DS, 


This dataset is an extensive collection of malware 
samples. They belong to different malware families 
and are all unique, meaning that some sort of poly- 
morphism/code reordering has been applied. 

However, not every sample downloads extra com- 
ponents, and among those which perform download 
activities, a large part cannot work properly these 
days. This is due to the fact that, before downloading 
the extra executables, the malware sample attempts 
to download some configuration files, which are not 
longer available. We select only working samples that 
download additional components, and up to 10 max- 
imum samples per family (in total 75 samples). 

To perform the experiment, we set up a client host 
running Windows XP SP3, as some malware sam- 
ples suddenly crash when executed under more recent 
OSes*, like Windows 7. No extra user activity is sim- 
ulated. For the DDE, we use the following settings: 5 
failed connection/download attempts in 1 minute in- 
dicate a possible malicious program. ‘The operational 
mode for this dataset is set to transparent mode. Ta- 
ble 1 summarizes our findings. 

Discussion Tests on DS 4 show the effectiveness 
of our approach. However, we have observed that 
for few samples and for a certain malware family 
in particular, the cuckoo’s egg is not actually exe- 
cuted. There are two distinct reasons for it. In the 
case of random samples, once the cuckoo’s egg in- 
jects its crafted DLL the parent process crashes. In 
the case of the “Killav” malware family, the mal- 
ware sample relies on the user to actually execute 
the download file(s). In all the other cases, there 
is no check run by the malware whether the down- 
loaded file is actually a “legitimate” malicious com- 
ponent. This enforce our assumption that malware 


4We investigated this issue and found some incompatibles 
among installed and expected system libraries. 
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Malware family | # of samples | # 


4 


NothingFound 1 


0 


marked 
licious 
DDE 





of that 


the 


samples | # samples 
as executed 
by cuckoo’s egg 


ma- 
the 


a) 


Table 1: Actual samples used in our tests with dataset DS,, samples flagged as malicious by the DDE 
and that executed the shipped cuckoo’s egg. The * marks a family of malware that actually downloads the 
cuckoo’s egg, but does not automatically execute it (and leaves this to the user). In most cases, the DDE 
detects failed download attempts, and the cuckoo’s egg is executed right away by the malicious sample, 


without any integrity check. 


writers do not currently protect their programs with 
encryption/hashing mechanisms. 

For the “NothingFound” family, whose name might 
refer to the fact that the submitted sample has not 
beed identified as malicious by CWSandbox, we have 
to report that the cuckoo’s egg has been actually ex- 
ecuted most times. 


3.2. Test with DS p 


This dataset is used to tests how our approach per- 
forms with (supposedly) brand new malware. Sam- 
ples have been collected in March 2011, and most of 
them would have not been detected by several an- 
tivirus software at the time of collection (we pro- 
cessed each sample through the VirusTotal [23] web 
site). We have a total of 30 malware samples from 
this dataset, which downloads extra malware compo- 
nents. For this set of tests, we also simulate regu- 
lar user activities such as browsing and downloading, 
with 30 different software, ranging from web browser 
to crawlers. Because the downloading program might 
not execute the cuckoo’s egg, we automate its execu- 
tion and set the parent process to be the downloading 
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program. For the DDE, we use the following stricter 
settings: 3 failed connection/download attempts in 1 
minute will indicate a possible malicious program. 


To perform the experiment, similarly to the tests 
with DS4, we set up a client host running Windows 
XP SP3. The operational mode for this dataset is 
set to semi-transparent mode. By doing so, we test 
at the same time how efficient heuristics are in detect- 
ing malware programs. Because some goodware pro- 
erams that the heuristics might send to the MAE for 
analysis could rely on the presence of certain system 
libraries, for this experiment the MAE is running on 
a mirror copy of the attacked system. When samples 
are sent to the MAE, we set a maximum amount of 
waiting time without operation performed of 3 min- 
utes: By doing so we avoid false positives in case of 
goodware, but might introduce false negatives in case 
of malware. Table 2 summarizes our findings. 


Discussion ‘This second round of tests confirms 
that even the latest malware code is still “vulnera- 
ble” to the injection of our cuckoo’s egg. Most sam- 
ples have been correctly identified by the DDE, and 
only 2 samples have been missed. ‘These samples 
have stopped their download attempts just after a 
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Correctly identified by the DDE 

That executed the cuckoo’s egg 

Correctly identified as malware by heuristics 
Erroneously identified as goodware by heuristics 
Sent to the MAE for analysis 

Erroneously identified by the DDE 

Correctly identified as goodware by heuristics 
Erroneously identified as malware by heuristics 
Sent to the MAE for analysis 


Malware 


Goodware 





# of samples | 


28/30 
27/30 (27/28) 
13/30 (13/27) 

2/30 (2/27) 


12/30 (12/27) 
10/30 
6/30 (6/10) 
2/30 (2/10) 
2/30 (2/10) 


Table 2: Results for tests with dataset DSp (in the third column we report partial results in brackets). 
Almost any malicious download attempt has been detected by the DDE, which shipped the cuckoo’s egg. 
The heuristics identified malware samples in almost half cases, and mistakenly flagged as goodware malicious 
samples only in a couple of cases. The false positive rate for the DDE is around 30%, and around 20% for 
the heuristics (when considering the cases in which the cuckoo’s egg was shipped). 


few tries. The DDE also mistakenly detects as mal- 
ware some regular programs. Actually this was an 
expected behaviour, as we set strict values for the 
DDE. Only one program did not execute the shipped 
cuckoo’s egg, due to a crash at the moment of in- 
jection. We experienced the same problem for sev- 
eral samples from DS'4, and our investigations show 
that the malware was not fully compatible with the 
installed set of libraries, and therefore would have 
crashed anyway. 

The heuristics prove to be an effective way to re- 
duce the workload of the analysis engine (and thus 
the manual workload for security analysts), when 
working in combination with the DDE. Less than 
50% of analysed malware samples have been sent to 
the MAE for analysis, and more than 50% of sam- 
ples have been immediately identified as malware. In 
the case of non-malware samples, 20% of analysed 
samples had to be sent to the MAE for further anal- 
ysis while 60% has been identified as non-malware 
without further analysis. For each successful execu- 
tion, the cuckoo’s egg would have been allowed to 
terminate the parent process, effectively containing a 
possible infection. 


3.3 System overhead 


Because the Avatar architecture performs an on-the- 
fly analysis, and eventually detection, the program 
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execution is slowed down by a certain overhead. De- 
pending on the operational mode, this overhead can 
range between milliseconds and minutes. A typi- 
cal example of the former case is transparent mode, 
when the heuristics flag the downloading program as 
actual goodware. In non-transparent mode, if the 
heuristics fail to determine the actual nature of the 
analysed program, the MAE is activated for further 
inspection (thus, slowing down the overall analysis 
time). All our benchmarks have been performed on a 
1.8Ghz Dual Core machine (target) and 3.0Ghz Quad 
Core machine (MAE). Table 3 reports overhead val- 
ues for different scenarios we observed during our 
benchmarks (overhead refers to the execution time 
of the cuckoo’s egg and, possibly, of the MAE). 


4 Related work 


In this section we discuss related works. As malware 
has become a serious security threat, a good deal of 
work exists that discusses techniques to analyse and 
detect malicious code. 


4.1 Malware Analysis 


Sidiroglou and Keromytis [19] present an architecture 
to detect and capture potential malware infection 
vectors by using a collection of heterogenous detec- 
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[Dataset | Min overhead | Avg overhead | Max overhead | 
[DSaSSSS~dSSCims YS ms |S ms 


| DS (malware samples) | _19ms_ | —-6is_—~«| SSCS 





[DS (goodware samples) | __I6ms___| 9s —=«dYSSCidS IS 


Table 3: Overhead time values for DS4 and DSg. When heuristics successfully identify the analysed sample, 
the overhead can be as low as 16ms. The maximum overhead value depends on the MAE analysis. 


tion engines. Engines range from host-based sensors 
monitoring the behaviour of applications and OSes to 
honeypots that simulate possible target applications. 
Each time a potential malware vector, e.g., a byte 
stream, is detected, it is copied and forwarded to a 
sandboxed environment, which runs some instances 
of the applications one wants to protect (e.g., the 
Apache web server) and a number of tools to verify 
the potential maliciousness of the input. The authors 
provide several strategies for fixing, among others, 
buffer overflow vulnerabilities “on-the-fly”. Despite 
the fact that authors do not provide any implemen- 
tation of their architecture, there are several simi- 
larities with our approach. Once the cuckoo’s egg 
is being executed, the suspicious program is copied 
and forwarded to a sandboxed environment for dy- 
namic analysis. ‘The main difference lies in the way 
we inspect the suspicious program, by crafting the 
cuckoo’s egg and sending it together with the origi- 
nal requested file. 

Anubis [3] and CWSandbox [24] are two prominent 
architectures for dynamic malware analysis. In par- 
ticular, Anubis can aggregate malware samples that 
present a similar behaviour into “clusters”. That is, 
although samples’ diversity is high (Anubis has ana- 
lyzed more than 1 million of unique malware samples 
so far), there are nearly 100.000 malware “families”. 


4.2 Malware Detection 


A number of heterogeneous techniques have been pre- 
sented to detect malware. 


Host-based Techniques Host-based techniques 
were the first to be used to detect and stop mal- 
ware (think of antivirus software). Their main ad- 
vantage is that they can detect malware even before 
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it is actually executed. Approaches range from simple 
byte-pattern matching, which scans a file for known 
malicious strings or instructions [21], to model check- 
ing [12] and compiler verification [5|. Unfortunately, 
such (static) techniques can be evaded using packers 
and polymorphism. 

In an effort to overcome typical limitations of 
matching-based approaches, Kolbitsch et al. [13] in- 
troduced a new concept of signature based on fine- 
grained models. Fine-grained models are graphs rep- 
resenting system calls invocation order (and other ad- 
ditional information) to match the characteristic be- 
havior of a given malware program. The model gen- 
eration is off-loaded onto a dynamic malware analysis 
tool (i.e., Anubis). This approach allows the detec- 
tion of unknown malware samples too, provided the 
“family” has been analyzed before. 


Network-based Techniques Regarding specific 
network-based techniques, several approaches lever- 
age information extracted by analyzing network traf- 
fic [8, 9, 11, 16]. 

BotMiner [8] combines a number of different traf- 
fic monitoring tools to extracts network communica- 
tion patterns and their content. Typical information 
that BotMiner takes into consideration are vertical 
and horizontal scans, exploit attempts, DNS queries, 
downloads of binaries. Then, BotMiner clusters hosts 
with a similar behavior and attempts to detect botnet 
nodes. Although network-based approaches could al- 
low, in theory, to perform on-the-fly detection, this 
is hard to realize because they miss the activity per- 
formed by malware on the host. 


Techniques based on Data Mining Several re- 
searchers address the detection of malware by using 
data mining techniques, in a effort to detect a higher 
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number of malware samples that are simply a variant 
of already known samples. 

Tabish et al. [22] notice that most of current mal- 
ware samples that are daily submitted for analysis 
are not brand new. Commonly, malware writers em- 
ploy techniques such as repacking to “obfuscate” mal- 
ware content and thus defeating approaches based 
on content matching, e.g., antivirus software. The 
authors devise an approach based on extracting sta- 
tistical and information-theoretic features from file 
blocks. A block is a fixed-sized chunk of byte-level 
contents of a given file. More than 50 distinct fea- 
tures are extracted, and then analyzed using math- 
ematical distance functions that are common in the 
data mining field (e.g., the Manhattan and Cheby- 
shev distances). The approach gives in general good 
results, but requires the analysis of several “good” 
file samples, e.g., executables, PDF documents, etc., 
to detect malicious files. 


5 Conclusion 


In this paper we present Avatar, a new lightweight ar- 
chitecture for on-the-fly, signature-less malware anal- 
ysis, containment and detection for large networks. 

Avatar does not require any special setup or soft- 
ware on the infected hosts. This is because the anal- 
ysis is not done on the allegedly infected host, but it 
is carried out on a remote system, which communi- 
cates with the (allegedly) infected host through the 
cuckoo’s egg. ‘The cuckoo’s egg provides also contain- 
ment functionalities. In fact, Avatar’s architecture is 
completely centralized. ‘This allows one to deploy it in 
any environment (like a corporate network) where the 
firewall can be modified to provide the needed facili- 
ties for the interception of suspicious downloads and 
the injection of the cuckoo’s egg. Basically, Avatar 
can be deployed in most work environments with very 
little effort. An additional advantage of a centralized 
architecture is that the updates in the analysis engine 
affect only one machine, as opposed to what happens 
e.g., with antivirus software, where all hosts have to 
be updated. 

An interesting aspect of Avatar’s architecture is 
that it can avoid some evasion techniques used by 
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malware; as we mentioned before, modern malware 
can check whether it is running in a sandboxed envi- 
ronment. Since our architecture does not deploy any 
extra tool, not even at kernel level, before hand, the 
malware has little way of detecting that it is under 
analysis. 

The detection in Avatar is necessarily based on 
heuristics, and is thus fallible. This however allows 
the detection of malware for which there is no sig- 
nature available yet. On the other hand, since the 
heuristics-based detection phase is always followed by 
an analysis phase before proceeding to the contain- 
ment, the risk of having false positives in the detec- 
tion phase is heavily mitigated by the fact that if the 
analysis phases determines that the suspected mal- 
ware is actually a legitimate program, the cuckoo’s 
ege can simply “release” it and allow it to continue. 

Our experiments show that our approach is effec- 
tive in detecting and containing malware, even un- 
known malicious code. We believe that Avatar can 
be the basis of an effective lightweight first line of 
defense against malware. 
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Abstract 


Botnets are a significant source of abusive messaging 
(spam, phishing, etc) and other types of malicious traffic. 
A promising approach to help mitigate botnet-generated 
traffic is signal analysis of transport-layer (i.e. TCP/IP) 
characteristics, e.g. timing, packet reordering, conges- 
tion, and flow-control. Prior work [4] shows that ma- 
chine learning analysis of such traffic features on an 
SMTP MTA can accurately differentiate between botnet 
and legitimate sources. We make two contributions to- 
ward the real-world deployment of such techniques: 1) 
an architecture for real-time on-line operation; and 11) 
auto-learning of the unsupervised model across differ- 
ent environments without human labeling (i.e. training). 
We present a “SpamFlow” SpamAssassin plugin and the 
requisite auxiliary daemons to integrate transport-layer 
signal analysis with a popular open-source spam filter. 
Using our system, we detail results from a production 
deployment where our auto-learning technique achieves 
better than 95 percent accuracy, precision, and recall af- 
ter reception of ~ 1,000 emails. 


1 Introduction 


“Botnets” are distributed collections of compromised 
networked machines under common control [7]. Auto- 
mated methods scan, infect, or socially engineer vulner- 
able hosts in order to incorporate them into the botnet. 
Botnets provide a formidable computing and communi- 
cation platform by harnessing the power of thousands, 
or even millions, of nodes for a common collective pur- 
pose [21]. Unfortunately, that purpose is often malicious 
and economically or politically motivated. 

As one common use scenario, botnets account for 
more than 85 percent of all abusive electronic mail 
(including spam, phishing, malware, etc) by one esti- 
mate [14]. Botnet-based spamming campaigns are large 
and long-lived [20], with more than 340,000 botnet hosts 
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involved in nearly 8,000 campaigns in one study [27]. 
The Messaging Anti-Abuse Working Group (MAAWG) 
coalition of service providers reported that across 500M 
monitored mailboxes in one quarter of 2007, 75 percent 
of all messages (almost 400 billion) were spam [18]. 
A subsequent 2010 MAAWG study reports the situa- 
tion has worsened: abusive messages accounted for 89 
percent of all electronic mail in a representative sample 
across many providers. 

Abusive message traffic abounds on the Internet. This 
deluge of unwanted traffic is more than a mere nui- 
sance: a broad survey of large service providers finds 
that abusive messages account for the largest fraction of 
expended operational resources [1]. Despite extensive 
research and operational deployments, attackers and at- 
tacks have evolved at a rate faster than the Internet’s abil- 
ity to defend. There remains ample room for improve- 
ment of in-production botnet attribution and mitigation. 

One promising approach for mitigating botnet- 
generated abusive messaging is statistical traffic analy- 
sis. Prior work [4] shows that by using transport-layer 
traffic features, e.g. TCP retransmits, out-of-order pack- 
ets, delay, jitter, etc., one can reliably infer whether the 
source of an email SMTP [16] flow is legitimate or orig- 
inating from a member of a botnet. Botnets must send 
large volumes of abusive messages to remain financially 
viable. Because bots are frequently attached via asym- 
metric (low upload bandwidth) residential connections, 
they necessarily congest their local uplink — an effect 
that is remotely detectable. Perhaps most importantly, 
transport-layer classifiers are content (e.g. the words of 
the message itself) and IP reputation (e.g. blocklist) ag- 
nostic, facilitating privacy-preserving deployment even 
within the network core. Deployed on individual Mail 
Transport Agents (MTAs), such techniques can permit 
early-rejection of messages before application delivery, 
significantly reducing system load. 

Thus far, research in transport-layer classification has 
been offline, where experimental data is examined a pos- 
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teriori. In this paper, we present the system engineering 
efforts required to integrate TCP transport features into 
the classification decisions of the popular open-source 
SpamAssassin [17] spam filter. A crucial obstacle to re- 
alizing such techniques is the ability to adequately train 
and build a model of normal and abusive traffic across 
a variety of operational environments. Rather than re- 
quiring human labeling or overly general models to build 
ground-truth, we exploit the auto-learning functionality 
of SpamAssassin. Our primary contributions include: 


1. On-line and real-time transport-layer classification 
of live email messages on a production MTA. 


2. Auto-learning of transport features to automatically 
learn the unsupervised model across different oper- 
ating environments without human training. 


The remainder of this paper describes related work 
(82). From this foundation, we describe our system ar- 
chitecture and testing methodology (83). We present pro- 
duction deployment results in 84 and discuss their impli- 
cations (85). We conclude by outlining future work. 


2 Related Work 


Recent research efforts have shown great promise 1n un- 
derstanding the character and behavior of botnets. While 
these proposed solutions are currently effective, they fre- 
quently rely on brittle heuristics and unreliable indica- 
tors. For instance, Xie et al. provide a system [27] 
to identify and characterize botnets using an automatic 
technique based on discerning spam URLs in email. 
Other research relies on IP addresses as indicators [29]. 
However, malicious botnet IP addresses are highly dy- 
namic as new hosts are compromised, existing hosts re- 
ceive new DHCP leases, or sources are spoofed [3]. In- 
deed, “fresh” IP addresses, i.e. those not in real-time 
blocklists, are a valuable commodity. Similarly, DNS is 
a poor identifier of malicious hosts given the prevalence 
of botnets employing DNS fast-flux [5] techniques to dis- 
tribute load among redirectors, survive node failures, and 
obfuscate back-end hosting infrastructure. 

A large body of work examines network-layer (IP) 
properties of botnets. Ramachandran et al. [22] charac- 
terize spamming behavior by correlating data collected 
from three sources: a sinkhole, a large e-mail provider, 
and the command and control of a Bobax botnet. By 
focusing on network-level properties including: 1) IP 
address space from which spam originates; 11) the au- 
tonomous system (AS) that sent spam messages to their 
sinkhole; and i111) BGP route announcements, they show 
that spam and legitimate e-mail originate from the same 
portion of the IP address space. Thus, IP addresses are 
not a reliable indicator of malicious or abusive nodes. 
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Subsequent work from Hao et al. [11] demonstrates 
that AS alone as a feature may cause a large rate of 
false positives. They achieve better results by extracting 
lightweight features from network-level properties such 
as geodesic distance between sender and receiver, sender 
IP neighborhood density, probability ratio of spam to 
ham at the time of day the message arrives, the AS num- 
ber of the sender, and the status of open ports on the 
sender machine. Further studies [15, 28] have shown 
that a spammer can evade such techniques by advertis- 
ing routes from a forged AS number [11]. 

Schatzmann et al. [24] similarly focus on network- 
level characteristics of spammers, but from the perspec- 
tive of an AS or service provider. Their idea is to pas- 
sively collect the aggregate decisions of a large num- 
ber of e-mail servers that perform some level of pre- 
filtering (e.g. blocklisting). Using passive flow collec- 
tion to gather byte, packet, and packet size counts, this 
aggregated knowledge can enhance spam mitigation. 

Commercial vendors expend considerable effort divid- 
ing the Internet IP address space into regions, with partic- 
ular attention given to identifying residential broadband 
addresses. By discriminating against residential hosts, 
the hope is to block traffic from nodes that should not be 
sourcing email in the first place. This approach is both 
brittle and raises architectural misgivings in the form of 
arbitrarily discriminating against classes of users without 
prior provocation. Such residential blocking may have 
implications on notions of network neutrality as neutral- 
ity legislation catches up with technology. 

In contrast to these spam detection and mitigation 
techniques, Beverly and Sollins [4] present a content and 
IP reputation agnostic scheme based on statistical sig- 
nal analysis of the transport (TCP) traffic stream. The 
premise is that spammers must send large volumes of e- 
mail to be effective, causing constituent network links 
to experience contention and congestion. Such conges- 
tion effects are particularly prominent for many botnet 
hosts which reside on residential broadband connections 
where there are large gateway buffers [12] and asym- 
metric bandwidth. Transport-layer properties such as the 
number of lost segments and round trip time (RTT) there- 
fore exhibit different distributions, permitting discrimi- 
nation between spam and legitimate behavior. Among 
many TCP features, their analysis found that RTT and 
minimum-congestion window are the most discrimina- 
tory. This transport-only classifier exhibits more than 90 
percent accuracy and precision on their data. 

Follow-on work to [4] explore similar ideas, including 
the use of lightweight single-TCP/SYN passive operat- 
ing system signatures at the router-level [10]. Ouyang 
et al. [19] conduct a large-scale empirical analysis of 
transport-layer characteristics on over 600,000 messages. 
Among tested features, their analysis similarly finds the 
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three-way-handshake latency, time-to-live (TTL), and 
inter-packet idle time and variance most discriminating 
for ham versus spam. These features remain stable over 
time, yielding 85-92 percent classification accuracy. 

Based on the encouraging results of this body of prior 
work, we endeavor to take a step toward the real-world 
deployment of transport-classifier based botnet detection 
and abusive traffic mitigation techniques. 


3 System Architecture 


The TCP/IP network stack logically divides functional- 
ity between layers. As a result, applications do not nor- 
mally have access to lower-layer features. For example, 
TCP (implemented in the kernel or lower) provides an 
abstraction of a reliable and in-order data stream to the 
application via a socket interface. Applications are re- 
moved from the details of packet arrival timing, order- 
ing, TTL, etc. Thus, our design must collect, on a per- 
message basis, transport-layer traffic characteristics and 
expose them up the stack to the SpamFlow (SF) plugin. 
This section describes our system architecture and the in- 
teraction between various components: SpamAssassin, 
SpamFlow, and the SpamFlow plugin. 


3.1 Overview 


We start with an overview of our SpamFlow system ar- 
chitecture, shown in Figure |. For clarity of exposition, 
we describe all functionality as being co-located with the 
Mail Transport Agent (MTA); however, the components 
can easily be distributed across different machines. The 
system is comprised of four main components: Spam- 
Assassin, the SpamFlow traffic feature extraction engine, 
the SpamFlow plugin, and the classification software — 
referred to as SpamAssassin, SpamFlow, SF plugin, and 
classifier respectively. 

Every message received by the MTA is processed by 
SpamAssassin and then piped to the plugin. Simulta- 
neously, SpamFlow continuously and promiscuously lis- 
tens on the network interface, capturing SMTP packets 
via the pcap API [13], aggregating packets into flows, 
and computing the relevant traffic statistics (e.g. TCP re- 
transmits, out-of-order packets, delay, jitter, etc.). The 
plugin queries SpamFlow with the message’s identifier 
in order to retrieve the flow-level transport features cor- 
responding to that message. Next, the plugin sends the 
message’s transport feature vector to the classifier. In re- 
sponse, the classifier returns a binary or probabilistic pre- 
diction (depending on the classifier employed) that then 
influences the final score of the message, and hence the 
final disposition. We describe each component in more 
detail in the following subsections. 
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Figure 1: SpamFlow system architecture: transport-layer 
features are aggregated on a per-flow basis. The Spam- 
Flow SpamAssassin plugin uses XML-RPC to obtain 
each message’s feature vector which is then sent to the 
classifier. Predictions are relayed to the plugin and inte- 
grated into the final SpamAssassin message score. 


3.2 SpamAssassin 


SpamAssassin [17] is an open-source, rule and content 
learning-based spam filter. Each rule is assigned a weight 
by a perceptron algorithm [25] and then the weighted 
scores are summed to produce an overall score for each 
message. The classification process involves comparing 
the overall score with a user-defined threshold t (which 
defaults to a value that maximized performance on a 
broadly representative training sample). If the score is 
above t, then the message is classified as spam; other- 
wise, as legitimate. SpamAssassin is modular and exten- 
sible for adding other filtering techniques. Popular plu- 
gins include real-time block lists (RBLs), domain-keys, 
permit lists, collaborative filtering, learning-based tech- 
niques (e.g. naive Bayes), and others. 


Furthermore, SpamAssassin features a_threshold- 
based mode in which new exemplar emails trigger an au- 
tomatic retraining process. While SpamAssassin refers 
to this retraining as “auto-learning,” this is typically 
known as “online” or “iterative” learning in machine 
learning. The primary difference is that advanced itera- 
tive learning approaches modify the classification model 
to account for new emails, whereas in auto-learning the 
entire model is rebuilt each time. In SpamAssassin auto- 
learning, a previously unseen message is used to retrain 
the model if it receives a score greater than TT (assumed 
spam) or less than tT (assumed non-spam). For example, 
when a message exceeds these threshold values, Spam- 
Assassin rebuilds the model of the built-in naive Bayes 
classifier, and classifies subsequent messages with the 
newly updated model. 
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--- src/smtpd/smtpd.c.orig 
+++ src/smtpd/smtpd.c 
@@ -2807,9 +2807,9 @@ 
* / 
if (!proxy || state->xforward.flags == 0) { 


out_fprintf(out_stream, REC_TYPE_NORM, 
"Received: from 4s (4s [%s])", 
"Received: from %s (4s [hs:hs])", 
state->helo_name ? state->helo_name : 
state->name, state->rfc_addr) ; 
state->name, state->rfc_addr, state->port) ; 


Figure 2: Postfix modification to support traffic identifiers 


3.3. SpamFlow 


SpamFlow [4] is our network analyzer. Using libpcap 
[13], SpamFlow promiscuously listens on the network 
interface and builds source host/port flows (the destina- 
tion MTA address is constant and known and thus not 
part of the flow tuple). As SMTP flows complete, either 
via an explicit TCP termination handshake or via time- 
out, SpamFlow extracts transport-layer features for each 
as detailed extensively in [4]. SpamFlow listens for XML 
queries for a particular flow’s IP and port, responding in 
kind with the features for that flow. 


We explored two options for uniquely identifying mes- 
sages to correlate between messages and their constituent 
flow data. First, every message contains a unique mes- 
sage string (“Message-ID” in the header) [23] to facil- 
itate replies, threading, etc. Using deep packet inspec- 
tion, SpamFlow could reassemble email messages from 
the packet payloads to uniquely identify each flow by 
Message-ID. The immediate downside to using the mes- 
sage identification field is that doing so removes the ben- 
efit of only examining packet header statistics: namely 
privacy and efficiency. 


Instead, we opt to follow a simpler approach and use 
remote host IP address and ephemeral port number as 
the message identifier. These fields are readily available 
without any transport reassembly and are, in general, 
unique. Naturally, IP address and port tuples are reused 
(there is a maximum of only 2'° unique TCP client-side 
ephemeral ports). For a tuple collision to occur in Spam- 
Flow, two identical flows must arrive within less time 
than the messages can be delivered to the MTA and pro- 
cessed by SpamAssassin, i.e. on the order of a few sec- 
onds. Not only is this in violation of the TCP time wait 
procedure, we do not observe any duplicate flows within 
such short time periods in our empirical data. 


The final detail is how to expose the message identifier 
to the plugin so it can query SpamFlow. We modify our 
MTA server to add the (JP_address, TCP_port) identifi- 
cation tuple of the remote MTA to the header of each in- 
coming e-mail. The actual MTA code modifications are 
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state->name 


--- received.c.orig 
+++ received.c 
@@ -44,2 +44,3 @@ 
char *remoteip; 
+char *remoteport ; 
char *remotehost; 
@@ -63,2 +64,5 @@ 
safeput (qqt,remoteip) ; 
+ remoteport = getenv("TCPREMOTEPORT") ; 
+ gqmail_puts(qqt,":"); 
+ safeput(qqt,remoteport) ; 
qmail_puts(qqt,")\n by "); 





Figure 3: qmail modification to support traffic identifiers 


small and straightforward. For reference we provide the 
code changes for the popular Postfix and qmail MTAs in 
Figures 2 and 3. 


3.4 SpamFlow Plugin 


SpamFlow does not operate as a standalone MTA or 
spam classifier. Therefore, we integrate it with an ex- 
isting one. We select SpamAssassin [17] because it is 
Open source and widely used; for instance, the commer- 
cial Barracuda [2] network appliance is based on Spam- 
Assassin. Importantly, SpamAssassin employs a modu- 
lar architecture that allows developers to extend its func- 
tionality through plugins. As SpamAssassin is written 
in Perl, we develop a small, lightweight SpamAssassin 
Perl plugin tying the various components of Figure 1 
together. In real-time, as e-mail messages are routed 
through the SpamFlow plugin, it scores them using a pre- 
viously learned model of transport features. This score, 
in combination with the scores from other rules, provides 
a final message disposition. 

The plugin acts as the controller of the system and 
binds the traffic analysis engine and the classifier to- 
gether. First, the plugin provides SpamFlow with the 2- 
tuple identifier of the message under inspection and re- 
ceives in return the corresponding message’s transport- 
layer features. After obtaining the features, the plugin 
passes them to a logically distinct machine learning clas- 
sifier and retrieves the corresponding prediction. Fig- 
ure 4 shows an example where the MTA added the mes- 
sage identifier (here, 77.239.18.226:37689) and the 
plugin attached SpamFlow’s transport feature vector to 
the message’s headers. 

Between components, we use XML-RPC [26] to com- 
municate. XML-RPC is a simple protocol that allows 
communication between procedures running in different 
applications or machines. Specifically, the client uses the 
HTTP-POST request to pass data to the server; the server 
in return sends an HTTP response. In our implementa- 
tion, we register the classifier with a classify proce- 
dure that takes as input the features. Thus, the plugin 
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From Josephine@rsi.com Tue Feb 01 23:21:58 2011 
Return-Path: <Josephine@rsi.com> 


X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on ralph.rbeverly.net 


X-Spam-Level: ****** 


X-Spam-Status: Yes, score=6.9 required=5.0 tests=BAYES_50,RCVD_IN_XBL,HTML_MESSAGE, 
SPAMFLOW, UNPARSEABLE_RELAY autolearn=no version=3.3.1 
X-Spam-Spamflow-Tag: 3792891725:37689,12,10,0,0,0,0,1,1,0,53248,34.464852 ,0.162818, 


120.441156,148.297699 ,51.891697 ,5840 ,48,1,64 


Received: (qmail 30920 invoked from network); 1 Feb 2011 23:21:57 -0000 

Received: from cm-static-18-226.telekabel.ba (77.239.18.226:37689) 

Received: from vdhvjcvivjvbwyhscvfwq (192.168.1.185) by bluebellgroup.com (77.239.18.226) with Microsoft SMTP 
Message-ID: <4D489025.504060@etisbew. com> 
Date: Wed, 2 Feb 2011 00:20:48 +0100 
From: Essie <Essie@Ghermes.com> 
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.12) 





Figure 4: Example email message headers with transport features added by the SpamFlow system 


sends the HTTP-POST request with the name of the clas- 
sify procedure along with the features, as comma sepa- 
rated values forming a string!, and receives via HTTP 
response the classification prediction from the classifier. 

Not only is XML-RPC simple and standardized, it 
allows the classifier to potentially operate on a differ- 
ent machine from SpamFlow, which in the future could 
allow the XML-RPC classifier to serve many Spam- 
Flow instances in a multi-threaded fashion and distribute 
load. Further, all popular programming languages pro- 
vide XML-RPC APIs, notably allowing us to use our 
language of choice for the various tasks. In our specific 
implementation, we develop SpamFlow in C++ while the 
classifier is a Python daemon. 


3.5 Classification Engine 


The final component of the system architecture is the 
traffic classification engine which we implement using 
the open source Orange [9] machine learning and data 
mining Python package. While the details of the ma- 
chine learning algorithms are out of scope for this paper, 
we note that Orange includes a variety of algorithms and 
statistical modules for performance evaluation. 

Our classifier implementation experiments with three 
machine-learning algorithms: naive Bayes, decision 
trees (specifically, the C4.5 algorithm), and support vec- 
tor machines (SVM). These three algorithms are broadly 
representative of different classes of learning strategies 
and allow us to evaluate both system classification per- 
formance, generality, and system speed. 


4 Results 


This section first describes results from load testing the 
SpamFlow system in a controlled laboratory environ- 


'The CSV string is used for expediency; in the future, we plan to 
use individual XML identifiers for each feature. 


ment in order to understand its practical feasibility. We 
then detail performance results using auto-learning of 
transport features in a live production environment. 


4.1 Load Testing 


To understand the system-level performance of our 
SpamFlow design as outlined in 83, we create the con- 
trolled testing environment depicted in Figure 5. One 
host runs the SpamFlow system and is physically con- 
nected to a second traffic sourcing host. The traffic sourc- 
ing host implements our custom e-mail “replayer’” appli- 
cation and a modified Dummynet [6] network emulator. 
The replayer reads from the TREC public email cor- 
pus [8] of 92,187 messages, of which 52,788 are spam 
and 39,399 are legitimate. For each message, the re- 
player: 1) extracts the headers and adds as recipient a 
valid user of our virtual-network domain; 2) establishes 
an SMTP session with the MTA (Postfix) of the Spam- 
Flow system under test; 3) sets the differentiated services 
code point (DSCP) in the IP header of each message ac- 
cording to the ground truth label (spam or ham); 4) uses 
the standard SMTP protocol to transmit the message. 
We set the DSCP differently for spam and non-spam 
messages in order to influence the emulated network be- 
havior. Our goal is to coarsely simulate the character- 
istics that botnet-generated spam traffic exhibits, such 
as TCP timeouts, retransmissions, resets, and highly 
variable RTT estimates. For our evaluation, we select 
Dummynet [6], a publicly-available tool that enables in- 
troduction of delay, loss, bandwidth and queuing con- 
straints, etc. for packets passing through virtual network 
links. In our testing setup, Dummynet applies differ- 
ent queuing, scheduling, bandwidth, delay, loss, etc. de- 
pending on the DSCP bits which correspond to email 
type. Dummynet emulates a only fixed propagation de- 
lay. We therefore modify it to generate random delays 
drawn from a normal distribution with a mean delay of 
UW = 150ms with o = 50ms standard deviation for spam 
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Figure 5: Laboratory testing environment: enabling 
tightly controlled and easily configurable repeatable ex- 
periments. The replayer application replays an email 
corpus while Dummynet emulates different network be- 
havior to mimic botnet and legitimate traffic. Using the 
testbed, we load test and debug SpamFlow. 


traffic that originates from the replayer, and a U = 40ms 
delay with o = 25ms for legitimate traffic in both direc- 
tions. We introduce delay in legitimate traffic in order 
to avoid overfitting our learned statistical model. These 
delays need not be precise as they are intended to merely 
mimic a congested environment. To emulate timeouts, 
retransmissions, and resets, we apply a random-packet- 
drop policy on the Dummynet pipe. 


Note that we disable all SpamAssassin rules requiring 
network access, e.g. real-time blocklists, as such rules 
are dynamic and thus sensitive to dates and time. 


While we recognize that our modifications to Dum- 
mynet only partially emulate a congested network (for 
example, loss events are independent — an assumption 
that does not hold true in a real queue), our goal in the 
emulation environment is to enable reproducible testing. 
Thus, we use the environment to emulate high-rate traffic 
and evaluate performance, throughput, system load, etc. 
on representative traffic. Section 4.3 goes on to detail 
real-world performance on live production traffic. 


Table | shows the performance of the three classifiers 
with respect to training time. C4.5 has the smallest train- 
ing time. SVM, on the other hand, has the largest training 
time, due to the more complex decision model. 


We then examine throughput: the rate at which the 
system is able to classify and process emails from the 
replayer. Naive Bayes, C4.5, and SVM achieve 1,300, 
1,000, and 700 messages per second throughput respec- 
tively in our environment. Naive Bayes provides the 
highest throughput, likely due to its simple decision rule. 


Many factors impact throughput; our intent is to un- 
derstand the relative performance of each classifier and 
to establish real-world feasibility. The takeaway from 
these measurements is that, taking into account the rel- 
ative independence of our system from the classification 
method, we can select the classification model that fits 
our needs. For example, the low training time of C4.5 
makes it a good candidate when we need to retrain often. 
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Table 1: SpamFlow training time (sec) as a function of 
classifier type and sample size 
Training Samples 


|| —__ Training Samples 


CHS [0.15 | 096 | 16.02 | 29.80 





4.2 Production Environment 


Live testing is important because it reveals how the sys- 
tem interacts with possibly unknown features of the ex- 
ternal environment. We deployed our system in a live 
environment at our university for a small domain from 
January 25, 2011 to March 2, 2011 and collected a trace 
of 5,926 e-mail messages. 

Ground truth was first established using an unmodified 
SpamAssassin version 3.3.1 instance without transport- 
layer traffic features, i.e. with only the default built-in 
rules and content analysis. We then manually examined 
all the legitimate ham messages and relabeled those that 
were false negatives. We manually sampled the spam 
messages to eliminate false positives and establish rea- 
sonable ground truth. While the volume of traffic cap- 
tured is small, our intent in this experiment is to establish 
the ability to auto-learn the transport-layer features in a 
production environment and ascertain the resulting clas- 
sification performance. We envision larger-scale, higher- 
volume live testing in the future. 

Auto-learning is the incremental process of building 
the classification model based on exemplar e-mail mes- 
sages whose scores exceed certain threshold values. In 
our case, we use features of e-mail messages otherwise 
classified via orthogonal methods as having very high or 
very low scores (for instance, those emails whose content 
triggers many of SpamAssassin’s rule-based indicators). 
Specifically, we explicitly retrain the classifier’s model 
each time a new message obtains an especially high or 
low score from the other SpamAssassin methods (rule- 
and Bayesian-word based); i.e. a score above or below 
set thresholds. After retraining is complete, we evalu- 
ate performance iteratively on subsequent messages un- 
til a new message arrives with a score above or below the 
threshold, triggering retraining again. 

Our thresholds selection is based on empirical spam 
and ham SpamAssassin score distributions. Spam mes- 
sage scores follow a normal distribution with = 16.3 
and o = 7.7, whereas scores of legitimate messages have 
a mean of W = 1.3, but are skewed left. Therefore, for the 
legitimate messages we first experiment with a threshold 
T’ = 16 and t =1, which allows the classifiers to be 
trained on an approximately even fraction of training and 
test examples: a total of 2,683/5,590 (48.0%) spam and 
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296/436 (67.9%) ham messages. 

We canonically call spam a “positive” and ham a “neg- 
ative” to indicate disposition. Correct predictions result 
in either a true positive (tp) or true negative (tn). A 
spam message that is mispredicted as ham produces a 
false negative (fn), while a ham message misclassified 
as spam produces a false positive (fp). Note that false 
positives in email filtering are particularly expensive for 
users as there is a high cost to missing or discarding le- 
gitimate messages. As performance metrics, we consider 
accuracy, precision, recall, specificity, and F-score: 


me = (1) 
yo | pt+fpt+tn-+ fn 
_ ip 
precision = ——— (2) 
tp+ fp 
ip 
i= —— 3 
reca aan (3) 
in 
specificity = ——— (4) 
fp+tn 


F—score = 


: ( precision * recall ) 6) 


precision + recall 


All of these metrics are important to consider to prop- 
erly understand system performance. For instance, accu- 
racy is misleading if the underlying class prior is heavily 
skewed: if 95% of the messages are in fact spam, then a 
deterministic classifier that always predicts “spam”’ will 
achieve seemingly high 95% accuracy without any learn- 
ing. Precision therefore measures, among messages pre- 
dicted to be spam, the fraction that are truly spam. Re- 
call measures the influence of misclassified spam mes- 
sages, 1.e. 1s a metric of the classifier’s ability to detect 
spam. Specificity, or true negative rate, determines how 
well the classifier is differentiating between false posi- 
tives and true negatives. Finally, because there is a nat- 
ural tension between achieving high precision and high 
recall, a common metric is F-Score which is simply the 
harmonic mean of precision and recall. 


4.3 Production Testing 


Figure 6 shows the classification performance metrics 
of the three classifiers we implement in SpamFlow as a 
function of cumulative training samples received. Fig- 
ure 6 therefore depicts the classifiers’ auto-learning over 
time as new exemplar training messages are received. 
Figure 6(a) displays cumulative accuracy for each 
classifier over time and includes the spam prior. The 
spam prior is simply the fraction of all training emails 
that are spam. A naive classifier could simply predict the 
prior, so values above the prior indicate true learning. We 
observe both decision trees and SVMs providing greater 
than 95 percent accuracy. Figure 6(c) similarly shows de- 
cision tree and SVM providing high F-scores, indicative 


of very good performance using only transport-layer fea- 
tures. Of note is that this level of performance is achieved 
after receiving only 100-200 messages. The weakness 
in SpamFlow only using traffic characteristics appears in 
the specificity, Figure 6(e), where false positives drive 
our best specificity down to approximately 75 percent. 


To better understand the sensitivity of our auto- 
learning results to the imposed thresholds T, we exper- 
iment with a spam threshold two deviations above the 
mean: T = 30. By increasing the spam threshold, the 
SpamFlow auto-learning uses fewer spam-training ex- 
amples. However, we expect to have higher confidence 
in their true disposition of spam with the higher thresh- 
old. Important to our evaluation, T~ = 30 has the effect 
of balancing the training complexion so that there is not 
a strong class prior: 227 exemplar spam messages and 
296 exemplar ham messages. 


With the spam score threshold raised to t* = 30, Fig- 
ure 6(b) shows that the spam prior is now close to 50 
percent, removing any training class bias. SVM and 
naive Bayes still achieve greater than 90 percent accu- 
racy. Again, clearly the auto-learning behavior is work- 
ing with performance steadily increasing over time and 
greatly outperforming the spam prior. As with the lower 
threshold, Figure 6(d) demonstrates very high F-Scores 
for all of the classifiers. 


Figure 6(f) highlights the challenge in false positives. 
However, the most specific classifier, the decision tree 
algorithm, is also highly accurate and precise. With 
machine learning there is an inherent trade off between 
achieving very high true positive rates and keeping false 
positive rates low. Our results demonstrate the best com- 
promise with the higher auto-learning threshold and the 
use of decision trees. 


Finally, we perform an initial investigation into 
whether the combined votes of SpamAssassin and Spam- 
Flow lead to overall improved performance. We experi- 
ment with adding 0.2 (experiment 1) and with adding 1.0 
(experiment 2) to the final score if SpamFlow predicts 
a spam message on the basis of transport traffic charac- 
teristics. Otherwise, we subtract 1.0 from the final score. 
This crude weighting does not leverage SpamFlow’s con- 
fidence in the prediction, and does not properly weight 
the vote in accordance with SpamAssassin’s other rules. 
We leave complete integration of SpamFlow’s predic- 
tions with SpamAssassin’s voting as future work. 


Table 2 shows the confusion data for SpamAssassin 
alone, SpamFlow alone, and the combination. In the first 
combined vote, we achieve better performance with the 
same number of false positives. In the second combined 
vote, we achieve even better performance, but at the cost 
of false positives. In all cases, the combination increases 
the overall F-score. 
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Figure 6: Auto-learning classification results for three SpamFlow classifiers on live production traffic as a function of 
cumulative exemplar training messages received. 
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from congestion induced by other nodes and other appli- 
cations. We believe there will remain adequate discrimi- 


Table 2: Confusion data comparing SpamAssassin per- 
formance with and without SpamFlow auto-learning 


natory signal to discern botnet hosts. Even when Spam- 
Flow does mispredict, our results show that combining 
5288 0.991 SpamFlow with other classifiers leads to improved per- 


SpamFlow 


5224 0.980 _| formance and can overcome instances of false positives 
SA+SpamFlow(1) | 5299 0.992 | by individual classifiers. 
SA+SpamFlow(2) | 5335 0.995 
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5 Discussion 


Can spammers adapt and avoid a transport-based clas- 
sification scheme? By utilizing one of the fundamental 
weaknesses of spammers, their need to send large vol- 
umes of spam on bandwidth constrained links, we be- 
lieve SpamFlow is difficult for spammers to evade. A 
spammer might send spam at a lower rate or upgrade 
their infrastructure in order to remove congestion effects 
from their flows. However, either strategy is likely to 
impose monetary and time costs on the spammer. 

Of note is that our techniques work equally well in 
IPv6 as the TCP transport-layer characteristics Spam- 
Flow relies on in IPv4 are the same in IPv6. The fact 
that SpamFlow is IP address agnostic suggests that it may 
be an even more important technique in an IPv6 world 
where the large address space 1s difficult to reliably map. 

One possible limitation of SpamFlow is that it may 
be unable to distinguish between a botnet host sending 
large volumes of spam and traffic from a host that is sim- 
ply busy, or on a congested subnetwork. However, other 
transport-layer features are decoupled from congestion, 
for instance a CPU-bound bot host will perform TCP 
flow control and advertise a small receiver window — an 
effect that SpamFlow uses as part of its decision process. 

Further, SpamFlow detects hosts that send volumes 
of email that exceed the local uplink and processing ca- 
pacity. Personal, home or small business servers do not 
have the same volume requirement as spammers and thus 
are unlikely to induce the same TCP congestion effects 
we observe. In reality, there is a value judgment that 
makes SpamFlow practical and reasonable. Specifically, 
users who wish to ensure that their emails are delivered 
typically invest in suitable infrastructure, contract with 
an outside provider or use their service provider’s email 
systems. Companies are not sourcing large amounts of 
crucial email from hosts attached by consumer-grade 
connections. The vast majority of home users utilize 
their provider’s email infrastructure or employ popular 
web-based services. Thus, SpamFlow only discriminates 
against sources that are both poorly connected and in- 
jecting large volumes of mail. 

However, in future work, we plan to experiment with 
the sensitivity of SpamFlow to false positive originating 


6 Conclusions and Future Work 


This research implemented the necessary infrastructure 
to perform real-time, on-line transport-layer classifica- 
tion of email messages. We plan to distribute our system 
as part of the third-party SpamAssassin plugin library in 
order to facilitate widespread deployment, impart impact 
on abusive messaging traffic, and to refine the system. 

We detail the system architecture to integrate network 
transport features with SpamAssassin, an MTA, and a 
classification engine. Our testing reveals that the system 
can handle realistic traffic loads. Of note, we tackle the 
bootstrapping problem of obtaining representative net- 
work traffic on a per-network basis by leveraging auto- 
learning to automatically train on exemplar messages. 

Using our techniques, we achieve accuracy, precision, 
and recall performance greater than 95 percent after re- 
ceiving only = 2!° messages during live, real-world pro- 
duction testing. We emphasize that these results come 
from observing only network traffic features; in actual 
deployment, the SpamFlow plugin will, as with other 
parts of the SpamAssassin system, place a weighted vote. 
Overall performance will likely improve using traditional 
features in addition to network traffic features. 

We note, however, that our live-testing corpus is small. 
Our intent in this work was to demonstrate the practi- 
cal feasibility of using transport network traffic features. 
In future work, we plan to investigate SpamFlow’s per- 
formance and scalability in large, production systems 
against much larger volumes of traffic. Our hope is to 
enable the practical deployment of transport-layer based 
abusive traffic detection and mitigation techniques to sys- 
tem administrators. 

Finally, we observe that the distributed computing 
platform offered by botnets enables a wide variety of 
attacks and scams beyond abusive email. Beyond mes- 
saging abuse, botnets are employed in phishing attacks, 
scam infrastructure hosting, distributed denial-of-service 
(DDoS) attacks, and more. For example, some bot- 
nets effectively provide a Content Distribution Network 
(CDN) for hosting scam infrastructure. Botnet CDNs 
are used to host web sites (e.g. landing sites for ordering 
prescription pharmaceuticals or redirection servers), dis- 
tribute malicious code, and a variety of other nefarious 
purposes. Still other botnets are employed to perform 
dictionary attacks against servers, brute force or other- 
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wise solve CAPTCHAs [30], etc. in order to create ac- 
counts on social network sites and further spread abusive 
traffic via multiple distribution channels. 

We believe transport-layer techniques generalize to 
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Abstract 


Most existing intrusion detection systems take a passive 
approach to observing attacks or noticing exploits. We 
suggest that active intrusion detection (AID) techniques 
provide value, particularly in scenarios where an admin- 
istrator attempts to recover a network infrastructure from 
a compromise. In such cases, an attacker may have cor- 
rupted fundamental services (e.g., ARP, DHCP, DNS, 
NTP), and existing IDS or auditing tools may lack the 
precision or pervasive deployment to observe symptoms 
of this corruption. We prototype a specific instance of the 
active intrusion detection approach: how we can use an 
AID mechanism based on packet injection to help detect 
rogue services. 

Tags: security, active intrusion detection, networking, 
trust relationships, recovery 


1 Introduction 


Existing network intrusion detection systems (e.g., 
Bro [35, 12], Snort [31]) typically take a passive ap- 
proach to detecting attacks: they scan network pack- 
ets and flows to match their content against known- 
malicious byte patterns (i.e., signatures). Such sensors 
are typically situated at the network edge or other traf- 
fic choke point rather than on individual hosts, and they 
rarely interpose on (1.e., inject packets or frames into) the 
actual connection or flow. 

IDS systems rarely take an active approach to detect- 
ing malicious behavior or indicators within the network. 
By active, we mean that the sensor purposefully injects 
packets and data meant to perturb the state of the net- 
work, in essence becoming part of the various connec- 
tions occurring on the network. Some existing IDS sen- 
sors may be “active” in the sense that they periodically 
scan some hosts or listen to some specific connections, 
or that they attempt to proactively firewall or quarantine 
hosts suspected of being malicious (for example, Net- 
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work Access Control or NAC). To the best of our knowl- 
edge, most existing IDSs do not actively participate in 
network conversations to deduce end host behavior. 

This hesitance may be due to the perceived danger of 
actively issuing network traffic designed to remotely di- 
agnose the existence of malware or corrupted service on 
an end host or server (such traffic might have an adverse 
effect on benign hosts or servers). 

In this paper, we suggest that the paradigm of active 
intrusion detection (AID) is relatively under-explored, 
and we offer an example of how such proactive scanning 
for malicious behavior at the network level can benefit 
a system administrator focused on recovering a network 
infrastructure from an attack that attempts to replace or 
spoof critical network services. 


1.1 Motivation: Intrusion Recovery 


Recovering a network infrastructure from an attack — 
particularly an attack that has compromised a large por- 
tion of the infrastructure [19] — is a complex, difficult, 
and time-consuming task. Furthermore, the administra- 
tor may not have much confidence in the services that 
remain running after the discovery of such a compro- 
mise. Because auditing and forensics are expensive pro- 
cesses (in terms of time and density of instrumentation), 
and such activities can be greatly curtailed because of the 
need to get the network back up and running, system ad- 
ministrators may have little information about what parts 
of the system remain trustworthy. 


1.2. The Challenge of Recovering Trust 


We see the fundamental difficulty in such a situation as 
the task of recovering trust in the network infrastruc- 
ture. For example, if the DNS server has been compro- 
mised, users cannot trust that their DNS queries have not 
been tampered with. Similar trust relationships exist with 
ARP and DHCP along with other critical network ser- 
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vices. In essence, each protocol implies that the client 
trusts the server to relay correct information about the 
network properties. Likewise, the server trusts clients to 
act only on their own behalf. 

Trust exists in many forms within the network. When 
a host accepts an offer of a DHCP address, it implicitly 
trusts the DHCP server it received the offer from. Sim- 
ilarly, when a switch participates in a bridge election, it 
implicitly trusts every other switch that is participating. 
This arrangement exists out of necessity, but can cause 
problems when the wrong entities are trusted. 

Modern attacks have increased 1n sophistication; many 
now involve hijacking benign hosts and their network 
stacks for malicious use (which requires altering the nor- 
mal behavior of the affected devices). Elements of the 
network infrastructure, such as routers and switches, are 
also attractive targets since network hosts frequently trust 
them implicitly. Compromising such machines can give 
an adversary a great deal of power without requiring him 
or her to attack very many machines. Such attacks are a 
useful way for an attacker to retain some level of control 
and spread, and one recent example! attempts to run a 
rogue DHCP service. 

Without the ability to meaningfully trust the informa- 
tion such services provide, and in the absence of strong 
authentication at such low levels of the network (as is 
typical for very good reasons, see Section 1.4), the task 
of rebuilding the network from scratch can require a Her- 
culean effort. 

In the course of rebuilding trust in critical low-level 
services, having a tool that can actively probe for the 
presence of a malicious or compromised low-level ser- 
vice can help identify remnants of an attacker attempting 
to spoof or man-in-the-middle these services. 


1.3 Focus 


This paper presents an early step toward a more mature 
infrastructure for supporting such network recovery ac- 
tivities. Although we are motivated by this problem, our 
emphasis and focus for the scope of this paper is limited 
to: 


e constructing a data model for representing trust re- 
lationships between network services, and; 


e implementing a proof-of-concept prototype that 
uses packet injection (via Scapy~) to probe suspect 
services and examine their responses under the trust 
relationship model. 


These active probes will not search for specific ex- 
ploits, as the examples we discuss in Section 2.4 do. 
Thus, we are not aiming to create a thorough vulnera- 
bility scanner. Instead, our probes are meant to test for 
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proper functionality and thereby trustworthiness. Our 
form of active probing is designed less to find out what 
causes a specific problem than to find whether a potential 
misconfiguration or malicious influence exists. 


1.4 Active Intrusion Detection 


Our primary contribution is to propose a new pattern for 
intrusion detection: actively issuing probes (in the form 
of specially crafted or purposefully malformed network 
packets) meant to reveal the presence or operation of 
rogue services. 

Most previous work, even of an active flavor, has dealt 
with detecting specific vulnerabilities or exploits. In con- 
trast, we introduce a method for crafting active scanning 
patterns meant to elicit a certain behavior from network 
hosts. Such a facility can help establish and maintain 
a basis for verifying the trustworthiness of network ser- 
vices on an ongoing basis (one can think of it as “Trip- 
wire’ for network behaviors). We are not searching for 
specific exploits (as the examples in Section 2.4 do), but 
rather search for deception patterns (1.e., indications that 
rogue services exist or that otherwise trusted services are 
compromised). 

We believe active probing is most useful in verifying 
the trustworthiness of certain key network services, in- 
cluding DNS, DHCP, and ARP. We call these services 
the Deception Surface of the network, because it is ex- 
actly this fabric upon which most users implicitly (and 
often unknowingly) base their belief that they are inter- 
acting with a trustworthy network connection or service. 
Since these protocols rarely involve authentication, they 
are ripe targets for deception. 

There are good reasons for not employing authentica- 
tion and authorization infrastructure at such a low net- 
work level. The effort involved in managing this equip- 
ment and these services in the presence of a variety of 
different authentication mechanisms and credentials is 
greatly increased. Without the need to predistribute cre- 
dentials, hosts are free to “plug-and-play” with the net- 
work; being able to simply trust these services by default 
is a labor-saving practice. For a large enterprise network, 
configuring each host with authentication credentials for 
all deception surface network services requires a large 
investment of valuable time and energy. Most users pre- 
fer their machines to work out of the box, and prefer to 
avoid extensive setup time. For this reason, such authen- 
tication (even if a mechanism exists, like DNSSEC) fre- 
quently remains unused, thereby leaving room for rogue 
services and deception. 


Central Assumption One of the central assumptions 
of our approach is the hypothesis that there is an equiva- 
lence between “normal” behavior and “trustworthy” be- 
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havior. As a consequence, our approach is currently best 
suited toward detecting malicious influence or attacks 
that change the normal behavior of a service; in other 
words, we cannot detect attacks that display syntacti- 
cally or semantically indistinguishable behavior (and our 
tool’s model of a service’s behavior may be incomplete, 
and thus unable to test features or characteristics that 
may have been changed). Our tool makes the assump- 
tion that changes in normal behavior are symptoms of 
either malicious influence or misconfiguration. 

Our goal with active probing is to significantly raise 
the bar for an attacker: now they need not only provide a 
rogue service, but mimic all the logic and failure modes 
of the “valid” service’s code logic and specific config- 
uration. In a sense, active probing helps swing the at- 
tacker’s traditionally asymmetric advantage to a network 
defender. 


2 Related Work 


Our work on active intrusion detection is inspired by re- 
cent examples of (largely manual) analysis of the proper- 
ties and behavior of exploits and malware (see examples 
below). At the same time, the most related work from a 
technical perspective is the body of work on OS finger- 
printing (see below) and detecting network sniffers (e.g., 
sniffer-detect.nse [21]). This latter script takes advantage 
of the fact that a network stack in promiscuous mode will 
pick up packets that are not intended for it, but after the 
stack removes the addresses, all higher layers assume the 
packet is intended for the local stack and act accordingly. 
The sniffer-detect script uses ARP probes in this manner, 
and by the responses it hears is able to make a determina- 
tion about whether or not the probed host is in promiscu- 
ous mode. At its core, the approach exploits an assump- 
tion made within the stack: that packets which reach the 
upper layers of the stack are supposed to be answered by 
that stack. This insight is an excellent independent ap- 
plication of the combination of the Stimulus-Response 
pattern and the Cross-Layer Data pattern (see Section 3). 
Port-knocking is a similar idea to active probing ap- 
plied to access control: by probing a host with a par- 
ticular pattern of packets, one can gain the ability to 
have the target firewall forward subsequent packets. In 
the theme of verifying host behavior to detect deviations 
from expected behavior, this work is conceptually related 
to Frias-Martinez et al. [15], who enable network access 
control (especially for MANET environments) based on 
exchanging anomaly detection byte content models. 


2.1 Finding Deceptions vs. Monitoring 


One central question is how much active probing dif- 
fers from existing network “good hygiene” monitoring 


practices like using a second or third independent net- 
work connection to actively monitor properties, services, 
and data that your network exposes to the outside world. 
We note that active probing is an extension of common 
practice to proactively scan internal networks with tools 
like NMap to discover open ports, new machines, or 
other previously unknown activity at the network edge 
or within an organization’s network core. Rather than 
just detecting open ports on machines that should not 
be there, our approach is predicated on reasoning about 
deceptions that exist in the network infrastructure. Al- 
though vulnerability scanning software (e.g., NeXpose’®, 
Nessus*) does probe hosts and servers, this type of prob- 
ing typically focuses on identifying vulnerable versions 
of software services rather than detecting the presence of 
malcode or malicious activity. 


2.2 Intrusion Detection 


Network intrusion detection systems like Snort and 
Bro [35] have a number of advantages: since they are 
passive, they do not impose load on the network and they 
can be difficult to detect. We detail some of the differ- 
ences between active and passive approaches to IDS in 
Table 1. 

Regardless of the response mechanism or other details, 
an IDS usually employs a paradigm of passive monitor- 
ing which depends on tracking packet streams and delv- 
ing into protocols [5]. This leaves them with several 
fundamental problems. IDS, whether passive or active, 
typically fail-open (..e., their failure modes do not cease 
operation of the monitored system and they can not tell 
when they miss an alert, 1.e., false negative). 

We suggest that the most relevant shortcoming of cur- 
rent network IDS with respect to the concept of active 
probing is that an IDS is left to guess the end state of 
all hosts on the monitored segment. Fundamentally, IDS 
only observes packet flows and cannot feasibly know the 
end-state of every host in the network, making it sus- 
ceptible to evasion attacks [30, 16]. Furthermore, trying 
to keep track of even limited amounts of state poses a 
resource exhaustion problem, and even keeping up with 
certain traffic loads can cause the IDS to miss packets. 


2.3. OS Fingerprinting 


Nmap uses a series of up to 16 carefully crafted probe 
packets, each of which is crafted for a variation in RFC 
specifications [20]. Whereas NMap issues probes to ob- 
serve the characteristics of the target network stack, the 
pOf tool uses passive detection, and it examines various 
protocol fields (e.g., IP TTL, IP Don’t Fragment, IP Type 
of Service, and TCP Window Size) [26]. Alternatively, 
LaPorte and Kollmann suggest using DHCP for finger- 
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Creates own context Must learn context from surroundings 
Detection based on behavior Detection based on signature and anomaly 
Constant probing is noisy Can run constantly without disturbing network 


Cannot run offline 


Can run offline 





Can learn only what is listened for in data model Can learn anything in a trace 


Table 1: A Comparison of Active and Passive IDS Properties. While both approaches face some of the same challenges 
(e.g., being fail-open), a hybrid (tightly coupled or otherwise) approach seems promising. 


printing [10], and Arkin suggests ICMP [3]. An interest- 
ing variation in this field is Xprobe2; rather than using 
a signature-matching approach to OS fingerprinting, it 
employs what its authors call a “fuzzy” approach. They 
argue that standard signature-matching relies too heavily 
on volatile specific signature elements. Xprobe2 instead 
uses a matrix-based fingerprint matching method based 
on the results of a series of different scans [4]. 

Fingerprinting OS network stacks and other services 
can be an imprecise activity frustrated by the use of 
virtual honeypots [29] or countermeasures like Wang’s 
Morph (Defcon 12). Morph operates on signatures of 
existing production systems, rather than creating decoys. 
Morph scrubs and modifies inbound and outbound traffic 
to mimic a specific target operating system, fooling both 
active and passive fingerprinters [18]. 


2.4 Examples of an Active Pattern 


The Conficker worm, unleashed in January 2009, rep- 
resents one noteworthy example of malware analysis 
that resulted in a way to diagnose the presence of Con- 
ficker’s control channel. The malware itself exploited 
flaws in Microsoft Windows to turn infected machines 
into a large-scale botnet [22]. It proved especially diffi- 
cult to eradicate. Because some peer-to-peer strains of 
the worm used a customized command protocol, subse- 
quent analysis and reverse-engineering provided a means 
of scanning for and identifying infected machines[6]. 
This example helps illustrate the utility of the general 
pattern of active probing for suspect behaviors. 

The Zombie Web Server Botnet provides another ex- 
ample of active exploit detection. First documented in 
September 2009, the exploit targeted machines running 
web servers, and once installed set up an alternate web 
server on port 8080, thereby avoiding some passive IDS 
monitors that only watch port 80. Hidden frames on af- 
fected websites contained links pointing to free third- 
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party domain names, which then translated into port 
S080 on infected machines. These infected web servers, 
which also serviced legitimate sites, then attempted to 
upload malware and other malicious content from this 
rogue 8080 port [1]. If the user’s web browser did not 
accept the uploaded malware, the exploit used an HTTP 
302 Found status to redirect the user to another infected 
web server. From there, the exploit re-attempted the mal- 
ware upload. This redirection was detectable by sending 
HTTP GET messages to the queried server and watching 
for 302 redirects [7]. 


As a final example of the utility of active probing, con- 
sider the Energizer DUO USB Battery Charger exploit 
(March 2010). The Energizer DUO Windows applica- 
tion allowed users to view the status of charging batter- 
ies and installed two .d11 files, UsbCharger.d11 in the 
application directory and Arucer.d11 in the system32 
directory. The software itself uses UsbCharger.d11 to 
interact with the computer’s USB interface, but it also 
executes Arucer.d11 and configures it to start automat- 
ically. 

Arucer.d1l1 acts as a Trojan horse, opening an unau- 
thorized backdoor on TCP port 7777 to allow remote 
users to view directories, send and receive files, and ex- 
ecute programs [11]. Since this rogue service responds 
only to outside control, passive detection may not be ef- 
fective. An active probe, however, can detect the unau- 
thorized open port even if not in use, and thus identify 
the infection more reliably [8]. 


2.5 Intrusion Recovery 


Recovering a compromised host or network 1s a difficult 
task. Classic [34, 33, 9] and more recent [32, 17] ac- 
counts can both be found, but little work on systematic 
approaches to recovery from large scale intrusions ex- 
ists [14, 25]. 
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3 Approach 


When a compromised machine exists on a network, there 
are two primary ways to find it. First, one can attempt 
to detect the malicious activity passively. Conventional 
intrusion detection systems provide a good example of 
this approach. However, compromised or rogue services 
may not display any behavior that is obviously malicious 
(thus evading misuse-based sensors) nor display behav- 
ior that is particularly new or different than previous 
packets (thus evading anomaly-based sensors). 

Our approach employs active probing. The assump- 
tion underlying the utility of active probing is that such 
probing can reveal discrepancies in internal behavior or 
configuration — particularly at corner cases and for mal- 
formed input. In this sense, active probing helps a net- 
work defender understand how an infection alters its 
host’s behavior or how rogue services operate. 

Active probing is a suitable tool for discovery of la- 
tent or otherwise stealthy malicious influence; we can 
probe hosts (or the network at large using broadcast ad- 
dresses) rather than waiting for them to send packets. 
Active probing can constructively infer network state and 
context by issuing targeted probes. 

Active probing exploits several unique features about 
a networked environment; in essence, this environment 
represents a distributed state and a set of computations 
(1.e., the network stacks) involved in manipulating the 
global state of the network. The arrangement of these 
relationships and the nature of most protocol interactions 
provide several key areas of focus for designing probe 
patterns (e.g., sequences of protocol messages intended 
to elicit distinguishing responses). 


3.1 Key Insight: Behavior Differences Due 
to Implementation or Configuration 


During our experimentation, we frequently observed that 
the same stimulus produced different responses from dif- 
ferent network entities. We discovered two reasons for 
this. The first reason relates to configuration. In some 
cases, responses differ because the two entities operated 
based on different configurations. For example, con- 
sider two identical DHCP server implementations pro- 
grammed with different gateways. All other network 
conditions being equivalent, these two servers will al- 
ways give a different result when queried, since they are 
programmed to do so. The richness of the configura- 
tion space can help distinguish between a rogue server 
set up for minimal interposition on a service and the full- 
featured service. 

The second reason relates to implementation. In most 
cases, one or more RFCs lay out the behavior a network 
service or protocol should exhibit. In practice, however, 


we find that differences exist, whether due to lack of 
specification for every possible case, or simple deviance 
from the specification. Generally, we found that imple- 
mentations perform similarly on common cases, such as 
well-behaved DHCP Discover packets. This observation 
makes intuitive sense, since specifications exist for them. 
It is the less well-behaved stimuli that are handled differ- 
ently. Corner cases and malformed input (e.g., semanti- 
cally invalid options pairings or flag settings) cause dif- 
ferent, infrequently exercised code paths to execute — it 
is unlikely that an attacker has replicated such behavior 
with high fidelity. 

Taken together, understanding these differences form 
the foundation of our method. If we look for both types 
of differences, then two entities must exhibit the exact 
same behaviors in order to escape notice. Put another 
way, if someone wants to masquerade as another on the 
network, the imitator must mimic not just the target’s 
normal behaviors in common cases (relatively easy) but 
the minor, idiosyncratic ones as well (we claim that this 
is harder). 


3.2 Stimulus-Response Pattern 


We note that many network interactions take the form 
of pairing between stimulus and response. The DHCP 
Discover/Response cycle, the DNS Query/Response cy- 
cle, and many others all fall into this category, whereas 
something like the Cisco Discovery Protocol does not. 
Note that the stimulus-response includes not only client- 
server interactions, but also peer-to-peer as well. We rely 
on and harness this stimulus-response paradigm for our 
verification method. 


3.3. Network ‘Trust Relationships and 


Trusted Data 


Trust relationships form the basic building block of the 
network. In the majority of cases, hosts trust essential 
services by default, to ensure ease of connection with- 
out the burden of extensive configuration. As an exam- 
ple, without prior configuration in an IPv4 environment, 
DHCP and ARP provide the primary ways for a host 
to learn about the network. Unfortunately, the scope of 
many modern networks makes these trust-by-default re- 
lationships all but necessary, since manually configuring 
and re-configuring every host in the network is often 1m- 
practical. As a consequence, they present an avenue for 
an adversary who can masquerade as a provider of one 
of these legitimate trusted services. If the adversary of- 
fers the same trusted-by-default service and can get his 
or her information believed, then he or she has compro- 
mised whatever elements of the network believe that in- 
formation. We target this sort of “trusted—by—default” 
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deception. 

Note that we do not specify anything about the exact 
process by which the deception we have just described is 
executed. It could be that the adversary has disabled the 
legitimate service, or is simply able to get its information 
out faster than the legitimate information does. Regard- 
less of the specifics, we begin in the place of a network 
entity and mistrust the service provider that we hear but 
whose trustworthiness we must accept for normal oper- 
ation. Everything we do constitutes an attempt to verify 
the trustworthiness of that service provider. Is the infor- 
mation they provide consistent? Do they respond in the 
way a legitimate service might if we make illogical or 
semantically invalid requests? Or, if they are an attacker 
intent on remaining stealthy, do they greedily respond 
to packets that look attractive to intercept and interpret, 
but are really meaningless (in terms of us getting on the 
network) and only mean to flush out such malicious in- 
terposition? 


3.4 Cross-Layer Data 


Sometimes, it helps to exploit the layered nature of net- 
work protocols. Consider a man in the middle attack, 
one of the most basic and most common compromises. 
An ordinary machine will pick up all packets and exam- 
ine them, discarding any that are not addressed to it. This 
behavior is expected from the majority of well-behaved 
machines on a network. However, a machine acting as a 
MITM will pick up these packets, examine them, per- 
form some sort of malicious activity (be it recording, 
modifying, fuzzing, or any number of other things), and 
then send them on to their destination. To do this, the at- 
tacker must modify the machine’s normal network stack, 
and configure the kernel to forward packets. This modi- 
fication makes the compromise detectable (see Section 6 
for our experiment on this topic). 


4 Active Probing Model 


We model active probing on the concept of a network 
conversation containing messages that reveal the viola- 
tion of conditions related to configuration or behavior, 
where these constraints represent the belief of the prob- 
ing entity about the valid, trustworthy state of the net- 
work. 

In essence, active probes attempt to verify some be- 
havior of the target host or service, and the messages 
emitted from the target host in response to our (crafted) 
protocol messages represent characteristics of that be- 
havior. Figure | depicts this interplay in a very basic 
form; the intent behind probing is to discover behavioral 
artifacts arising from differences in implementation or 
configuration (as discussed in Section 3.1). 
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T prepares m’, 


T sends m’, 





P adds r(m’)) to R 


Figure 1: Ladder Diagram for Active Probing Data 
Model. A probing host P (our prototype plays this role) 
issues probes designed to exercise logic and configura- 
tion corner cases in the target host T. As T reacts to 
these probes (and generates m, subject to its implemen- 
tation quirks and configuration details), P builds a set of 
data relevant to the trust relationship being probed. 


Our model consists of two parties P and T. P is the 
prober and has the ability to simulate multiple protocol 
stack implementations (especially “broken” ones). The 
second, T, is the target or service provider. P’s hypothe- 
sis 18 that JT may contain a broken, partial, incomplete, 
or incorrectly configured protocol stack. If T were a 
trustworthy service, it would display “normal” expected 
behavior according to the trust relationship between P 
(rather, the role of the client or peer that P is playing) 
and T (more specifically, the server or peer that T may 
be masquerading as). In this sense of having an estab- 
lished trust relationship, we say that T provides a service 
X to P. 


For P to consider 7 trustworthy with respect to service 
X, T must satisfy a set of conditions C on its behavior. 
To verify that these constraints hold, P uses a sequences 
of messages M = m,,m2,...,m, sent to T that take the 
form of packet probes. 


For each m;, there exists a corresponding message m, 
from T to P that may be a packet, a sequence of packets, 
or the absence of a packet (determined through a pre- 
configured timeout). For each such m,, there is some rel- 
evant portion r(m,) that serves as evidence for or against 
some particular element c; ¢ C. As each m, is received 
(or not received), P performs the operation R = RU r(m;), 
building a body of evidence R as shown in Figure 1. 
Once all probes have been sent and answers recorded, 
the probing entity decides whether or not RF violates the 
conditions contained in C. 
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5 Methodology 


In attempting to recover trust in key network services, 
a system administrator would follow the general tasks 
outlined below. The procedures we describe are typ- 
ically aimed at an auditing-style service rather than a 
general-purpose scanner for running malware or botnet 
command and control. As such, the definition of a trust 
relationship, the specific verification plan, and the format 
and content of probes are usually service specific and 1n- 
formed by administrator knowledge of their own service 
implementations and configuration. We posit (but have 
not shown) that the amount of effort needed to follow the 
steps below is similar to tuning of IDS rules or calibra- 
tion of IDS parameters to a specific environment. 


5.1 Define Trustworthiness 


In order to establish the trustworthiness of a network 
service, there must be a notion of what trustworthiness 
means for that service. This will vary based on the net- 
work and service being verified, and in most cases will 
depend on the specific deployment of the service being 
probed. This trustworthiness criteria directly informs the 
set of constraints C. 

For example, trustworthiness in the forwarding case 
means that no hosts but known gateways should exhibit 
forwarding behavior. For a more complicated system, a 
definition might take into account information the legiti- 
mate service should provide (for example, a known-valid 
set of DNS responses), and ways it should respond to 
certain stimuli (e.g., how the service handles a particular 
corner case configuration or incompatible flags). Gener- 
ally speaking, the definition is what we need to hear to 
trust the speaker. 


5.2 Verification Plan 


Once we have an idea of what trustworthiness looks like, 
we need to develop a plan of how to verify it. Recall 
the two types of differences between service providers 
we discussed earlier. Many network entities have pecu- 
liarities to their implementations, and the plan for verifi- 
cation should make use of them. It is also necessary to 
get as much standard information from the probed entity, 
so that both types of differences can be detected. The 
more information gathered, both about the service 1m- 
plementation and the service configuration, the harder an 
adversary must work to fool our probe. In doing this we 
need to plan to check our service against every part of the 
trustworthiness definition we have already developed. 
For the examples above, the verification plan might 
range from a simple comparison of known good answers 
to specific DNS queries to the absence of a “forwarded” 


packet. In essence, each verification plan is tightly cou- 
pled to the actual method of detecting a specific decep- 
tion on the network. As yet, we do not contend with 
automating this process. 


5.3. Probe Creation 


The next step in our methodology calls for turning the 
plan into a set of active message probes and codify- 
ing those probes. Although a variety of packet crafting 
mechanisms exist, we found Scapy, a packet generation 
and manipulation tool, to be helpful. The codified probes 
crafted in Scapy’s environment comprise the functional 
portion of an active verification tool. 


5.4 Reply Detection 


Finally, we need to capture the replies to our probes and 
examine them against the constraints derived from our 
trustworthiness definition. With that information, we 
must make a determination as to the trustworthiness of 
what responses the probes cause. 


5.5 Implementation 


We have found Scapy [27], a freeware packet manipula- 
tion program, quite useful. Scapy allows users to build, 
sniff, analyze, decode, send, and receive packets with in- 
credible flexibility. It does not interpret response packets 
directly, so it can prove more useful than other packet 1n- 
jection or scanning tools in some scenarios. It employs 
Python-based control, so its commands are also easily 
adapted into Python programs. We have used Scapy to 
implement our prototype probing tool. Currently, ver- 
ification plans (and their corresponding probes) require 
individually-developed Scapy scripts. 


6 Case Studies: Detecting Deceptions 


Our preliminary evaluation focuses on illustrating our 
prototype’s effectiveness at detecting network deceptions 
rather than attempting to detect malicious software (e.g., 
botnet command-and-control, spyware). To a certain ex- 
tent, the related work we discuss in Section 2 illustrates 
how one might go about using existing tools like nmap to 
identify command-and-control or backdoors. Although 
we illustrate how to detect (1) a duplicate DHCP server 
and (2) the presence of a host configured for forward- 
ing, our point is that these two examples are patterns of 
network deceptions, and this is the main intent of our ap- 
proach. 
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6.1 Detecting Forwarding Behavior 


As an example of using the Cross-Layer Data pattern, 
one of our first experiments dealt with detection of for- 
warding behavior. As ordinary machine will typically 
silently discard packets (frames) not addressed to it. This 
behavior is expected from the majority of well-behaved 
machines on a network. A machine acting as a MITM, 
however, will pick up these packets, examine them, per- 
form some sort of malicious activity (be it recording, 
modifying, fuzzing, or any number of other things), 
and then send them on to their destination. To accom- 
plish this MITM, the attacker must modify the machine’s 
normal network stack settings and configure the kernel 
to forward packets. This modification is remotely de- 
tectable. 

We hypothesized that if we sent a broadcast packet out 
to the network with the destination as our own machine, a 
host configured for forwarding might give itself away by 
sending the packet back to us. We used Scapy to test this, 
sending IP packets carrying a layer 2 broadcast address 
and a layer 3 address of our own machine. We found that 
many forwarding entities (for example, Linksys routers) 
did identify themselves by forwarding the packet as ex- 
pected, but Linux kernels in forwarding mode do not. 
We hypothesized that this was due to the layer 2 broad- 
cast address of the packet. To test this hypothesis, we 
replaced the broadcast hardware address with a unicast 
address of the machine we wanted to probe, and listened 
for the response. We found that this resulted in the packet 
being sent back to us, as expected. We codified this re- 
sult into an Nmap plugin that detects hosts in forwarding 
mode that are behaving in what is generally an undesir- 
able manner and thus may have been compromised or 
misconfigured. 

Formally speaking, in this experiment of detecting a 
host in forwarding mode, the condition C is that only a 
small known set of hosts on the network should be in for- 
warding mode (specifically: that only those hosts should 
deliver the packet we generated back to us because we 
chose the packet contents in such a way as to be con- 
sumed by hosts that are promiscuous and forwarding, but 
when processed by higher layers of the network stack, 
don’t realize that they shouldn’t be sending this packet 
back to its origin); 1f the responses that P gathers con- 
tains an IP address outside this set (1.e., we see our mes- 
sage from m; in R), we know that the trust condition is 
violated. 


6.2 Rogue DHCP Server 


To demonstrate the viability of an active probing ap- 
proach, we have implemented it on the Dynamic Host 
Configuration Protocol. DHCP makes an excellent sub- 
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Figure 2: Basic DHCP Setup With Cisco Switch. This 
self-contained test environment consists of a set of com- 
puters connected to a single Cisco switch (a DHCP 
server, the rogue DHCP service, and our prober P). This 
setup was also used for the forwarding detection sce- 
nario. 


ject for a case study for several reasons. First, it provides 
new hosts several critical pieces of knowledge about the 
network, such as an IP address, gateway information, and 
location of the DNS servers. Typically, a network stack 
sends out a DHCP Discover immediately after coming 
online, highlighting DHCP’s importance. If an adversary 
can get malicious DHCP information believed, he or she 
can exert a great deal of control over the deceived hosts. 
Second, it comprises part of our Deception surface, so 
most hosts trust whatever DHCP traffic they receive by 
default.° 

We see a recent example of an exploit using DHCP 
in a variant of the Alureon rootkit. This exploit infects 
networks and sets up a rogue DHCP server to compete 
with the legitimate one. This rogue server gives out the 
address of a DNS server under the control of the worm’s 
authors, which in turn points users to a malicious web 
server. This web server attempts to force the user to up- 
date their browser, but they instead are downloading a 
malware that will reset their DNS pointer to Google’s 
service once the machine is infected [2]. This is the sort 
of exploit that motivates us to examine DHCP closely. 

Our prototype software uses Scapy scripts to probe 
DHCP servers and can both produce PCAP fingerprint 
files and compare to an existing PCAP fingerprint. In 
practice, we found it successfully distinguished between 
the different servers we used. 


6.3 Environment 


We used three main environments for our DHCP experi- 
ments. First, as shown in Figure 2, we have a small, self- 
contained test environment consisting of a set of comput- 
ers connected to a single Cisco switch. This was also the 
environment we used for the aforementioned forward- 
ing detection work. We configured one of the computers 
with an instance of the udhcpd DHCP server, and ran our 
tests from the other. 
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We also have access to the production network at Dart- 
mouth College’s CS department. We used the same ma- 
chine for our tests here as in the previous environment, 
and ran our experiments against the actual production 
server. Finally, we have the DHCP server included in a 
Linksys WRT54G2 router. It runs as the core of a small 
home network. 

As described previously, we tried to determine both 
the configuration of the probed server and the server it- 
self. We are not interested in identifying the specific 
server implementation, but rather detecting its differ- 
ences from another server. We do so by taking a brute- 
force approach, where we try several different values for 
different fields. Some of these are well-behaved values, 
and others are designed to test the server’s handling of 
unusual traffic. This allows us to test both the server’s 
configuration and its implementation. 

To test the usefulness of our software, we compared 
responses to probes across our test environments. Do- 
ing so simulates the introduction of another DHCP agent 
onto the network whose traffic we are seeing instead 
of the legitimate server’s traffic. This process is akin 
to comparing a previous behavior model captured in 
a known trustworthy state with a later behavior model 
gleaned from an environment during recovery. We can 
probe the server at a time when we assume it to be in a 
trustworthy state; this can be established by (1) manual 
inspection of the program or process, (2) some kind of 
integrity check of the code and configuration files a la 
Tripwire, or (3) immediately after a new deployment of 
the service. 


6.4 Constructing DHCP Probes 


Within a DHCP packet (see Figure 4), four fields (ciaddr, 
yiaddr, siaddr, and giaddr) contain IP addresses, while a 
fifth (chaddr) contains a MAC address. We check the 
servers’ handling of these fields by setting each one in 
turn to four different types of values: 


e The client’s currently assigned IP address 
e Another valid IP address in the client’s subnet 
e A valid IP address in another subnet 


e An invalid IP address 


We also do something similar for the chaddr field: 
e The client’s MAC address 

e Another valid MAC address 

e An all-zeroes MAC address 


e An all-ones (Broadcast) MAC address 


# Request probe w/ ciaddr set to other IP 

state = random. getstate() 

probeFunc (Ether (src=get_if_raw_hwaddr(conf.iface) [1], 
det="tf<tirstisifstr sti") 

/IP(src="0.0.0.0", dst="255.255.255.255") 

/UDP(sport=68, dport=67) 

/BOOTP (flags=0x8000, 
chaddr=get_if_raw_hwaddr(conf.iface) [1], 
giaddr=ip, 
xid=random.randint(0, 4294967295) ) 

/DHCP (options=[("message-type", "discover"), 

("end") ] 
), state) 


# Request probe w/ chaddr zeroes 

state = random.getstate() 

probeFunc (Ether (src=get_if_raw_hwaddr(conf.iface) [1], 
dst="{f:fisfi <fi:ti sti") 

/IP(src="0.0.0.0", dst="255.255.255.255") 

/UDP(sport=68, dport=67) 

/BOOTP (flags=0x8000, 
chaddr="00:00:00:00:00:00", 
xid=random.randint(0, 4294967295) ) 

/DHCP (options=[("message-type", "discover"), 

("end") ] 
), state) 


# Request probe with chaddr nonsense 
state = random.getstate() 
probeFunc (Ether (src=get_if_raw_hwaddr (conf .iface) [1], 
dest=" "ff :fisti fitti:ti**) 
/IP(sre="?0.0,0:0"?», <dst="? 255.255.255.255" *) 
/UDP (sport=68, dport=67) 
/BOOTP (flags=0x8000, 
chaddr=’ ’gg:gg:gg:gg:gg’’, 
xid=random.randint(0, 4294967295) ) 
/DHCP (options=[(‘ ‘message-type’’, ‘‘discover’’), 
C**end??)] 
), state) 


Figure 3: Example Probes for DHCP. Three of the eleven 
probes we constructed for profiling the behavior of a 
DHCP server. We took a profile of the known good 
DHCP service and compared it against another profile 
from a different machine. 


We also send discover probes that manipulate option 
values. These include normal options and the parameter 
request list, which allows the requesting client to ask for 
specific information from the server. We set a number of 
options in our probes and assign them values (where ap- 
plicable) as described above. We also send a number of 
probes requesting different information from the server 
using the parameter request list option (we do not believe 
that Scapy implements all of the options). 


6.5 Results 


The tool successfully identified that significant differ- 
ences exist between the production DHCP server and the 
Linksys router. Not only were the configurations dif- 
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0 1 2 3 
012345678901 23456789012345678901 
t—t—t-t—t-titatitatitatitatitatitititatitatitititititititatitatet 


| op (1) |  htype (1) | hlen (1) | hops (1) | 
4$--------------- 4--------------- 4--------------- 4--------------- + 
| xid (4) | 
4$------------------------------- 4------------------------------- + 
| secs (2) | flags (2) | 
4$------------------------------- 4------------------------------- + 
| ciaddr (4) | 
4$------------------------- +--+ +--+ +--+ +--+ +--+ = +--+ +--+ + 
| yiaddr (A) | 
4$--------------------- +--+ +--+ +--+ +--+ +--+ +--+ +--+ + 
| siaddr (4) | 
4$---------------------- +--+ +--+ +--+ +--+ +--+ +--+ +--+ ++ + 
| giaddr (A) | 
4$------------------------- +--+ +--+ +--+ +--+ +--+ +--+ + 
| | 
| chaddr (16) | 
| | 
| | 
4$----------------------+-------- +--+ +--+ +--+ +--+ -- +--+ +--+ + 
| | 
| sname (64) | 
4$----------------------- +--+ +--+ +--+ +--+ +--+ +--+ +--+ + 
| | 
| file (128) | 
4$----------------------- +--+ +--+ +--+ +--+ +--+ +--+ +--+ +--+ + 
| | 
| options (variable) | 
4$----------------------- +--+ +--+ +--+ +--+ +--+ +--+ + 


Figure 4: DHCP Message Format. This diagram was 
copied verbatim from RFC2131 [13]. 


ferent, but it turned out that the Linksys DHCP agent 
in our third environment ignored several of the less 
well-behaved probes, leading to an easy identification. 
While not comprehensive, we believe this successful re- 
sult demonstrates the value of our approach to active in- 
trusion detection. 


7 Discussion & Future Work 


We discuss how our active probing methodology would 
apply to two other critical network services (DNS and 
ARP). This is in essence future work, but we offer the 
sketches as evidence of the feasibility of extending this 
type of probing to other fundamental network services. 
We are currently extending our analysis (and crafting 
probes) to other services like SNMP, STP, NTP, and rout- 
ing protocols. Each of these protocols requires a different 
type of approach to composing a verification plan, since 
their modes of operation may not naturally fit a query- 
response pattern. In such scenarios, we can take advan- 
tage of the cross-layer data pattern and trust relationship 
patterns. 


In the two examples below, our probing has different 
semantics than our “rogue DHCP” and “forwarding de- 
tection” examples. Since we sketch an outline of a verifi- 
cation plan, we focused on relatively easy ways to verify 
the trustworthiness of these services (e.g., for DNS com- 
paring against known good answers). We could, how- 
ever, rely on a behavioral signature much more in line 
with the DHCP experiment by issuing probes that exer- 
cise little-used options or ask for incomplete or illogical 
DNS and ARP resolutions. 
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7.1 Domain Name System 


We describe one way in which our method might apply 
to Domain Name System (DNS). DNS operates on the 
stimulus-response client-server model, where the client 
sends name resolution requests to the server, which in 
turn queries as many other servers in the DNS hierar- 
chy as is necessary to get an answer [23, 24]. In many 
cases, DNS responses are trusted by default, since they 
represent the best and frequently only information a host 
has about how to resolve external names to machine ad- 
dresses. As such, attacks on DNS are fairly common, 
since successfully doing so could trick a host into send- 
ing all of its traffic to the adversary. 

Clearly, the trustworthiness of DNS depends on giving 
correct answers to queries. Our system should be able 
to determine whether or not the responses it hears are 
correct. If not, we can assume that the server we are 
querying is untrustworthy. Note that this does not mean 
that the server we are querying directly has an issue, but 
since DNS servers form a hierarchy, a trust issue with 
one could mean trouble for many others. 

In creating a verification plan, we cannot feasibly ex- 
amine what answers the server gives for every possible 
query. We can, however, pick a number of common 
queries and build a list of responses we should receive 
for each one. We need a list rather than a single re- 
sponse, as one name frequently has several hosts which 
respond to queries for it. This list should be large and 
diverse, and the answers built from manual research or 
compiled from DNS queries to different servers, mini- 
mizing the possibility that a compromised server con- 
tributes to our definition of trustworthiness. We also want 
to feed the server some malformed requests, both with 
poorly-formed packets and for names known not to exist 
(this will have to be checked) to test the implementation 
details of the server. 

Our probes would take the form of DNS question 
packets as described previously, which could be done 
with Scapy. Query responses could be listened for, and 
responses checked against the list discussed previously. 
If we hear any unexpected responses, an alert could be 
raised indicating that a possible issue with the server ex- 
ists. 


7.2 Address Resolution Protocol 


We describe how our method could apply to Ad- 
dress Resolution Protocol (ARP). ARP operates on the 
stimulus-response model where each host or gateway can 
both make and service requests [28]. ARP provides im- 
portant information enabling communication both within 
and across networks, and its information is generally 
trusted by default, so it provides a good illustration for 
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Protocol | P| OT 
DNS 


DNS Server 
DHCP Server 
Host/Gateway 

Switch 


Host 
Host 


Host/Gateway 
Switch 


Name resolution information 
IP address, gateway address, DNS address, etc. 
IP address:MAC address mapping 
Bridge priority, path cost 





Network Device 


Network Device 


Addresses, device information 


Table 2: Enumerating Services Involved in a Network’s “Deception Surface”. The protocols here form the main decep- 
tion surface of a network; we list them in the context of our data model’s trust relationship syntax. Even though there 
are “secure” variants of some of these protocols, networks do not always use them because requiring authentication 
infrastructure in order to establish basic layer 2 and 3 connectivity can be cumbersome and difficult to maintain. 


active probing. 


The definition of trustworthiness for ARP should state 
that all hosts respond to queries for their IP address with 
their own MAC address, and gateways also respond to 
queries for IP addresses outside their network segment 
with their own MAC address. Any deviance from this 
model could indicate a deception occurring. 


Merely looking at ARP replies in isolation may not be 
sufficient. Consider the following scanning strategy. A 
prober conducts an ARP scan of a given set of addresses, 
and for each address scanned it does two things. First, 
it listens for replies and raises an alert if it hears more 
than one different MAC address in response. Second, if 
only one response is received, it saves that response to a 
hashtable. It then would check for one of two conditions. 
The scan can either look for the address it has just heard 
appearing in the hashtable twice, or to look for it to not 
be in the hash. 


The prober needs to run this scan against both its local 
network (excluding the gateway) and against addresses 
outside its network. The former warns the prober of un- 
trustworthy ARP behavior of hosts and servers on its own 
segment, and the latter of such behavior associated with 
its gateway. The scan needs to look for both the presence 
of duplicate addresses and their absence for this reason: 
all non-local addresses should resolve to the same ad- 
dress, which should not have been seen for any local ad- 
dress. 


If we do not observe this, we know that we have traf- 
fic intended for multiple IP addresses going to the same 
device on the network. This falls outside the definition 
of trustworthy ARP behavior, and the prober can raise an 
alert. We could run forwarding detection against the non- 
gateway IPs which returned the duplicate MAC, but it is 
not strictly necessary. Probes would take the form of a 
simple ARP scan, with a supporting hashtable. The tech- 
nique employs a brute-force approach, but should suc- 
cessfully detect ARP issues on the network. 


7.3 Limitations 


Although the probing approach we discuss is meant to 
serve as a kind of “tripwire for trust,’ it has several 
shortcomings. Of particular interest going forward is 
the consideration of how to scale the process of pro- 
ducing a verification plan and the concomitant probes 
to very large networks (along with large networks con- 
taining non-TCP/IP networking equipment). In a sense, 
the manual nature of writing probing scripts both helps 
and hinders the ability to scale. On one hand, writing 
scripts for a small number of critical pieces of network 
infrastructure benefits from the manual attention to detail 
and the knowledge of the system administrator about the 
quirks or peculiarities of the system being probed. On 
the other hand, in a highly heterogeneous environment 
containing a network composed over years from a vari- 
ety of organizations, the sheer diversity of core services 
poses a significant challenge. 

One way to deal with this challenge is to focus on de- 
tecting the presence of certain types of deceptions rather 
than verifying the behavior of every last system. An- 
other (complementary) approach would require research 
that can attempt to generate a set of probes from pristine 
(or trusted) configuration files and/or binary code of the 
target service. 

The stimulus-response pattern for detecting untrust- 
worthy behavior may not apply well to protocols that are 
not purely request-response based (e.g., they may operate 
on a stream of asynchronous update messages). We can 
attempt to verify the behavior of such services through 
trust relationships and cross-layer data (for example, for 
a routing protocol we might spoof or issue route with- 
drawals or announcements from one peer and see if the 
target announces such messages to another peer). 

Finally, we have not explicitly considered the effect 
the use of such active probing might have on IDSs extant 
in the target environment. It is likely that certain types of 
IDS might alert on messages from the prober, especially 
if they are malformed in some fashion. Dangers here in- 
clude the IDS increasing its alert logging (and thereby 
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increasing the noise in its alert stream or logs) as well as 
subtly changing their view of the network. In general, a 
coordinated security response from multiple independent 
security mechanisms is a hard unsolved problem. Nev- 
ertheless, one of our primary uSe cases is in a network 
that we are attempting to recover; we might expect to 
ignore the secondary effects of such probing in favor of 
re-establishing core critical services. 


$ Conclusion 


This paper suggests that active intrusion detection (AID) 
techniques hold promise as a useful network security pat- 
tern, particularly when attempting to verify that basic 
constraints or characteristics of the network hold true. 
We presented an approach to AID based on probing: 1s- 
suing crafted packets meant to elicit a particular type of 
response from the target system or host. 

There are several conceptual lessons to take away from 
this work. Our main approach is predicated on probing 
the “corner case” behavior and configurations of network 
services and verifying that services return known—good 
answers. Our main assumption is that normal behavior is 
in some sense equivalent to “trustworthy.” Feeding a sys- 
tem crafted input meant to exercise corner cases in logic 
or configuration serves as a good heuristic for revealing 
behavior that might carry highly individualized informa- 
tion. We hypothesize that meaningful differences in the 
characteristics of network trust relationships can reveal 
malicious influence (or at least a bug or misconfigura- 
tion). 

We suggested three patterns for building verifica- 
tion plans and exploring this space of varied behavior: 
stimulus-response, cross-layer data, and trust relation- 
ships. This approach can help users, client hosts, and sys- 
tem administrators verify the trustworthiness of network 
services, especially in the absence of strong authentica- 
tion mechanisms at layer 2 and 3. We discussed how to 
apply this method to DNS and ARP, we crafted packets 
that can remotely detect a host in forwarding mode, and 
we implemented a Scapy-based prototype to verify the 
trustworthiness of a DHCP service. 
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Abstract—Detection and remediation of security incidents (e.g., 
attacks, compromised machines, policy violations) is an increas- 
ingly important task of system administrators. While numerous 
tools and techniques are available (e.g., Snort, nmap, netflow), 
novel attacks and low-grade events may still be hard to detect 
in a timely manner. In this paper, we present a novel approach 
for detecting stealthy, low-grade security incidents by utilizing 
information across a community of organizations (e.g., banking 
industry, energy generation and distribution industry, govern- 
mental organizations in a specific country, etc). The approach 
uses netflow, a commonly available non-intrusive data source, 
analyzes communication to/from the community, and alerts the 
community members when suspicious activity is detected. A 
community-based detection has the ability to detect incidents that 
would fall below local detection thresholds while maintaining the 
number of alerts at a manageable level for each day. 


I. INTRODUCTION 


Detection and remediation of security incidents (e.g., at- 
tacks, compromised machines, policy violations) is an increas- 
ingly important task of system administrators. While numerous 
tools and techniques are available, novel attacks and low- 
grade security events may still be hard to detect in a timely 
manner. Specifically, system administrators typically have to 
base their actions on observing the local traffic to and from 
their own networks as well as global security incident alerts 
from organizations such as SEI CERT!, Arbor Atlas”, or 
software and hardware vendors. However, stealthy targeted 
attacks may slip below detection thresholds both in the local 
data alone or on the global scale. 

Furthermore, the nature of internet-based attacks is changing 
from random hacking to financially or politically motivated 
attacks. For example, botnets are increasingly leased out to 
highest bidders and DDoS attacks are often used as a means for 
blackmail. Moreover, attacks targeting industries with financial 
information (e-commerce, banking, gaming, insurance) are in- 
creasing and the threat of attacks against SCADA (supervisory 
control and data acquisition) systems in electrical power gen- 
eration, transmission, and distribution (among other industrial 
process control systems) is even considered a potential target 
for terrorism [10]. 


“http://www.cert.org/ 
*http://atlas.arbor.net/ 
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Targeted attacks might not leave a large traffic footprint in 
the targeted organization since one machine with access to 
the desired information or control system may be sufficient 
for the attacker to achieve their goals. It 1s often difficult to 
detect such low-footprint attacks based on local monitoring 
alone because it is often necessary to set local alerting thresh- 
olds high enough not to generate too many false positives 
and overwhelm the system administrators. But as a result, a 
stealthy attack or compromise may lay undetected. Therefore, 
it is possible for an attacker to target many such organizations 
without being detected. For example, the attacker may want to 
maximize profit by attacking multiple financial organizations 
concurrently before the vulnerability used is detected and 
corrected. Similarly, terrorists may require the control of many 
companies to achieve their goal of large scale damage. 

In this paper, we present a novel approach for detecting 
stealthy, low-grade security incidents by utilizing information 
across a community of organizations (e.g., banking industry, 
energy generation and distribution industry). We will show 
by using an example that we can find possible attacks (or 
attempts) that only transfer very little data (e.g., a few bytes) 
and thus would remain undetected by conventional approaches. 

The remainder of this paper is structured as follows. In 
Section II, we present the technical approach based on netflow 
data and construction of communities of interest. Section III 
describes the implementation of the system, including the 
algorithms used for the analysis. We evaluate the performance 
of our system in Section IVi and present selected case studies 
of suspicious activity we have identified in Section V . 
Section VI outlines related work in the area and Section VII 
concludes the paper. 


II. APPROACH 
A. Service vision 


Our technique is based on the concept of community, in our 
case defined as a collection of (at least two) organizations. A 
community can be specified based on any criteria relevant for 
attack detection. For example, it could consists of businesses in 
a particular industry (e.g., banking, health care, insurance, etc), 
organizations within a country (e.g., businesses and govern- 
ment agencies in one country), or organizations with particular 
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type of valuable information (e.g., industrial espionage or 
customer credit card information). We detect stealthy security 
attacks by observing the communication to/from the member 
organizations of a community. The intuition being that within 
each organization only very few machines may be attacked 
or compromised and as a result an attack can be very hard 
to detect within each organization. However, by observing the 
communication behavior across multiple organizations in the 
community, such stealthy behavior may become visible. 

Given that we analyze communication in the Internet, each 
organization is defined by the list or range of IP addresses 
belonging to the organization. We consider Internet communi- 
cation connections (reported by netflow, for example) within 
the communities and between communities and external IP 
addresses who do not belong to any community. For our 
analysis, all the IP addresses within an organization can be 
collapsed into one identifier representing the organization. 
Any communication between two IP addresses where neither 
belongs to one of our communities and neither has com- 
municated with a community in the past can be ignored. 
Furthermore, communication with IP addresses belonging to 
commonly used Internet services (e.g., search, news, social 
media) can be white listed and removed from consideration. 

We construct a communication graph for each IP address 
that communicates with at least one organization in a com- 
munity as illustrated in Figure 1. This figure shows the 
communication graph for an external IP address (1.e., some IP 
address outside any of the communities of interest). This node 
has communicated with two communities, one consisting of 
organizations 7 and 8, and the other consisting of organizations 
1 through 6. A directed edge from some node A to some other 
node B in the graph indicates that A has sent messages to B. 
Although not depicted in the figure, each edge may contain 
additional information, such as the combinations of source and 
destination ports used. 

The weight of the edge is used to quantify the importance 
of the communication. The importance can be based simply 
on the number of messages or bytes sent, or the number 
of contacted individual members in the targeted organization. 
However, some communication may be more important than 
others from security point of view. For example, some port 
numbers are more often involved with malicious activity (e.g., 
based on CERT reports) and communication using such ports 
can be weighted more heavily. 

The weight is also used to limit the size of each graph. The 
size of the graph is determined by the number of nodes it 
contains. If the size exceeds a given threshold, we remove the 
weakest links until the threshold is reached. This is necessary 
because storing all communications would require too much 
space even for a single day. For example, in our data set con- 
sisting of heavily sampled netflow, a given weekday contains 
about 860 million entries. These 860 million recorded netflows 
originate in 28 million distinct IP addresses. Therefore, if we 
would not filter unimportant IP addresses, we would need to 
store 28 million graphs. Moreover, each of these 28 million IP 
addresses often connects with | to 2 million other IP addresses. 
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Fig. 2: Communication graph for a community member 


Thus, if we did not limit the size of each graph, we would 
have some graphs that are too large to fit into memory. The 
situation would be even more challenging if we analyzed the 
data for one month or a week instead of the current one day 
at a time. 

As already stated, we also consider communication within a 
community and across communities. With that, we are able to 
detect already compromised computers inside an organization 
when they try to attack further organizations as shown in 
Figure 2. To reduce the number of false positives (many 
organizations have frequent contact with other organizations 
of the same or other communities), a computer inside an 
organization that belongs to a community (or is contained in 
the whitelist) has to show more suspicious behavior than an 
external IP address before an alarm is generated. For example, 
we do not consider communication via port 443 with or across 
communities. 

Given such communication graphs, a potential security 
incident is suspected when an IP address communicates with 
a specified number of community members. Typical examples 
of security threats that can be detected using this approach 
include botnet controllers managing a number of bots in the 
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community, compromised machines downloading stolen infor- 
mation on a dedicated server, an attacker targeting machines 
in multiple organizations, as well as many security policy 
violations (e.g., illegal software download sites, etc). The 
number of alarms can be controlled using thresholds and the 
system can memorize IP addresses that have already been 
reported recently. When there are false positives, the system 
administrators can extend the whitelist. 

An IP address may contact a large number of community 
members either because the community is actually targetted 
or if the attacker is targetting all or most of the Internet (e.g., 
broad port scan). The system administrators may want to react 
differently to these alternative scenarios. Therefore, for each IP 
address that has contacted a community member, our system 
keeps track of how many times it has communicated with IP 
addresses outside our communities of interest. 


B. Input data 


Our community-based alerting service uses netflow as its 
input data source (although other types of information could 
be utilized as well). Netflow is a standard data format collected 
and exported by most networking equipment, in particular, 
network routers. It provides summary information about each 
network communication passing through the network equip- 
ment. Specifically, a network flow is defined as an unidirec- 
tional sequence of packets that share source and destination IP 
addresses, source and destination port numbers, and protocol 
(e.g., TCP or UDP). Each netflow record carries information 
about a network flow including the timestamp of the first 
packet received, duration, total number of packets and bytes, 
input and output interfaces, IP address of the next hop, 
source and destination IP masks, and cumulative TCP flags 
in the case of TCP flows. Note, however, that the netflow 
record does not contain any information about the contents 
of the communication between the source and destination IP 
addresses. 

The community-based alerting service requires access to 
netflow to/from each of the organizations in the community. 
Such data can be collected by each of the organizations in the 
community at their edge routers and then collected at a central 
location for processing. Alternatively, it can be provided by 
an ISP that serves a number of the organizations in the 
community. Note that the netflow data may be sampled (to 
reduce the volume of the data) and the actual IP addresses 
of the computers within each organization can be obfuscated 
prior to the analysis (e.g., all IP addresses belonging to an 
organization can be collapsed into one address) if desired. 

Given the collected netflows and the IP address ranges 
belonging to each member organization in the community, 
our alerting service analyses the data (either real time or in 
daily or hourly batches) and generates alerts to the system 
administrators. The analysis algorithm is described in Sec- 
tion III. A whitelist can be used to eliminate any legitimate 
communication destinations from consideration (e.g., search 
engines, CDNs, banking, on-line retailers, etc). 
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III. IMPLEMENTATION 


A. Architecture 


The architecture of our system is presented in Figure 3. We 
use three different types of processing components that do not 
share any state and are executed as individual processes: the 
parse, the filter, and the graph components. Each component 
can be replicated and executed by any number of processes 
(e.g., L, M, N1, and N2 in the figure). Every process of every 
component has a unique id (from 0 to the number of processes 
for the component-1) that is used for message routing. Since 
the parse component is connected to the filter component, each 
parse process is connected to each filter process. The same is 
true for the filter and graph components. Note that the system 
supports multiple different kinds of graph components in one 
system configuration as illustrated by Graph J] and Graph 2 in 
the figure. Different graph components can be used to realize 
different alerting conditions as we will describe below. 

The communication between components is based on event 
messages that are sent via TCP-channels. A message consists 
of a key and a body that are defined by the pair of interacting 
components (e.g., parse and filter, or filter and graph) and may 
contain any information desired by these components. For the 
key, a hash function f must be available that maps the contents 
of a key into an unsigned integer, which is used to route the 
event message to the right receiving process. For example, if a 
parse process is connected to 2 filter processes (1.e., M = 2), 
the receiving filter process is chosen by calculating the modulo 
of the hash of the key and 2. Thus, in this particular example, 
all keys with even hashes would be routed to the first and all 
keys with odd hashes to the second filter process. 

The internal state maintained by each component is parti- 
tioned by the same key, making it possible to distribute their 
processing load onto multiple cores efficiently. 
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Fig. 3: Data processing architecture 
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Each network flow is processed as follows. First, the netflow 
data is read from a local storage device (it could also be re- 
ceived in real time from a router). The parse component trans- 
forms the IP addresses from their original string representation 
(i.e., “AAA.BBB.CCC.DDD’’) into an integer representing the 
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IP address* and constructs a message with 5 fields: sourceIP, 
source-port, destinationIP, destination-port, and transferred- 
bytes. The parse component sends this message to the filter 
component. It uses the sourceIP field of the message as the 
key. The filter component either forwards (using the same key) 
or discards the received message. This decision is based on 
various factors, like used ports and source and destination IPs. 
If the message is forwarded, it is forwarded to one process of 
every graph component (e.g., Graph 1 and Graph 2). Finally, 
the graph components construct a community graph for each 
source IP. The filtering and community graph construction are 
described in detail below. 


B. Filtering 


The filter is an essential part of our analysis and its role is 
to remove irrelevant flow records and to reduce the amount 
of data that needs to be processed by the graph component. 
For example, commonly used search, news, social media, and 
entertainment web sites are used so frequently that they would 
appear with almost every community. Furthermore, any traffic 
that does not involve at least one community member is not 
relevant for the analysis and is filtered out. Other filtering 
actions can be chosen based on data volume and perceived 
threat vectors. For example, HTTP-traffic may be filtered to 
reduce data volume, but at the risk of missing attacks that use 
HTTP (port 80). 


1: Example Filter algorithm 


input : (src-IP, src-port, dst-IP, dst-port, transferred-bytes) 
output: The same as the input, if not filtered 


//collapse IP addresses 
src-IP, dst-IP = collapse(src-IP), collapse(dst-IP); 
//filter IPs of commonly used web sites 
if src-IP € whitelist then 

return @); 
end 
//filter web-accesses to community-members 
if dst-IP © community then 

if src-IP € community then 

if src-port = 80 then 
return 0); 


end 
end 
end 


//only forward if one of the IPs is in the community 
if dst-IP € community OR src-IP € community then 

return (src-IP, src-port, dst-IP, dst-port, transferred bytes); 
end 


Algorithm 1 shows an example filter component that filters 
connections based on their ports, and source and destination 
IP addresses. First, the algorithm collapses IP addresses for an 
organization into one address. If, for example, an organization 
has the IP range from 141.1.0.0 to 141.85.255.255 and 
either the src-IP or dst-IP are within this range, it is set 
to 141.1.0.0. We then discard every connection from IP 


3We will continue calling this identifier an IP address to enforce the one 
to one connection between these numerical IDs and the IP addresses. 
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addresses that are contained in the whitelist. Second, accesses 
to a community member’s web-server are filtered. Finally, 
we only forward the event message if at least one of the 
connection end-points is contained in the community. 


C. Community Graph 


We build a fixed size (4) Community of Interest (COD) 
graph for each IP address that is received by the graph 
component. Essentially, we use a windowed top-K algorithm, 
as described in [3]. However, there are two significant dif- 
ferences in our implementation compared to [3]. First, our 
window is not based on a fixed time interval, but rather on 
the observed connections. This has the benefit that the COIs 
of IP addresses with many connections will be updated more 
often than of those with very few. Second, we introduce 
several COI views ({V1,..., Vn }) that use different methods 
to determine the weight of a connection. We can, for example, 
favor connections that transfer many bytes over those that 
only transfer a few by using the transferred bytes as the 
edge weight. Obviously, in this case we would not be able 
to detect attacks that transfer only a small set of data if these 
connections are dominated by large file transfers. Therefore, 
we define another view that uses the port numbers involved in 
security incidents to weight the edges (i1.e., the more reported 
security incidents for a port, the larger the weight). Our system 
supports any number of such views running in parallel, as 
depicted in Figure 3 (with Graph J implementing a different 
view than Graph 2). 

Algorithm 2 shows how the COI is constructed in more 
detail. The algorithm uses two main data structures: a window 
that is used to collect recent data and a COI graph that stores 
the COI graph as seen from the beginning of the analysis run. 
We first add the received connection to the window. If more 
than 1000 connections have already been added, the window 
is merged with the COI graph. To this end, for each JP in 
the window, the weight of each edge is calculated, multiplied 
with a damping factor 1 — 6 and added to the weight in the 
COI, which is first multiplied with 6. Since 6 = 0.85, the 
influence of the new connections in the window is dampened. 
We also merge the port-mapping per destination-IP. It maps the 
source-port to the destination-port and a counter, counting how 
often this port-combination was used. Thereafter, the weights 
of all contacts in the COI that have not been observed during 
the current window are decayed by multiplying them with @. 
To keep the COI at a maximum size of K, we remove the 
weakest links until the size of the COI is equal to K. Finally, 
the window and the counter are reset. 


D. Generating Alarms 


We showed above how the COI graph is constructed. Here, 
we provide two complementary algorithms to detect suspicious 
IP addresses. 

The first, shown in Algorithm 3, is used to pre-filter all IP 
addresses that belong to a community. However, if a computer 
inside the community is compromised, we still want it to be 
checked further. To this end, we iterate over all connections in 
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2: Example Community graph construction 


input : (src-IP, src-port, dst-IP, dst-port, transferred-bytes), s = 


State[src-IP], F 
output: None 


//Save connection in window 
s.window|[dst-IP].transferred_bytes += transferred-bytes; 
s.window|[dst-IP].port_map[src-port][dst-port]++; 
s.counter++; 
//Merge window into topK after 1000 events 
if s.counter > 1000 then 
foreach /P € s.window do 
//@ has a value of 0.85 in our analysis. 
s.topk[IP].weight = 1 - 6 * V(s.window[IP]) 
+ 0 * s.topk[IP].weight; 
//Merge the window’s port map with the top-k’s 
foreach {source-port, dest-port} € 
s.window[IP ].port_map do 
s.topk[IP].port_map[source-port][dest-port] += 
s.window[IP].port_map[source-port][dest-port]; 


end 
end 


//Decay weight of old connections 

foreach IP ¢ s.window do 
s.topk[IP].weight = 6 * s.topk[IP].weight; 

end 

//Remove the weakest links 

while size(s.topk) > K do 
remove_weakest_link_from(s.topk); 

end 

s.window = @; 

s.counter = 0; 

end 


the IP’s top-K and check each pair of ports. The pairs of ports, 
considered suspicious, are specified using a configuration file. 


We call Algorithm 4 for all IP addresses returned by 
Algorithm 3. It assures that (1) only those IP addresses that 
connected to at least min_cnt members of the community 
will be reported and (2) that the connections to the community 
make at least min_part percent of all the connections of the 
current IP address. 

The detection algorithm can be run either for all IP ad- 
dresses at once or individually for each IP address. Therefore, 
it is possible to provide different detection latencies. For 
example, to detect a suspicious IP address the earliest possible, 
the algorithm must be executed as soon as a message is 
received for its source-IP’s top-K. If this is not necessary, the 
algorithm can be run for all top-Ks in one graph process at 
any desired interval. 

The generated alarms can be emailed to the system admin- 
istrators in the affected organizations or posted on a security 
dashboard. The reports contain the complete top-K for each 
suspicious IP address, including the port mappings. 


IV. EVALUATION 
A. Input data and general setup 


We currently run the experiment on a per-day basis. This 
means we fetch the netflow entries of the last 24 hours and 
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3: Suspicious IP detection (1) 
input : IP, community, s = State[IP] 
output: IP, if suspicious; @, if not 
//blacklisted IPs are always suspicious 
if JP € blacklist then 


return /P; 
end 


//check if IP is in the community 
if JP € community then 
//iterate over all of IP’s connections 
foreach conn € s.topk do 
//iterate over all ports of one connection 
foreach p © s.topk[conn].port_map do 
//check if src_port and dst_port are suspicious 
if is_suspicious(src_port, dst_port) then 
return /P; 


end 
end 
end 


//no strange ports -~ skip 


return (); 
end 


//not in community -, check 
return /P; 


4: Suspicious IP detection (2) 

input : IP, community, min_cnt, min_part, s = State[IP] 

output: Alarm 

//check if top-K connections of this IP are in the community 
often enough 


cnt = count_community(community, s.topk); 
part = cnt / size(s.topk); 


if [P ¢ blacklist then 
if cnt < min_cnt OR part < min_part then 
return false; 


end 
end 


return (true; 


run our analysis. We do not carry any state from one daily 
run to the next. In principle, we could leave the system 
running continuously or checkpoint the graph component and 
re-initiate its state on the next day. However, we found it useful 
to start with a clean system every day since this makes it easier 
to reason about the impact of changes in the community and 
white lists. 

Moreover, we introduced the concept of different views in 
June 2011. Since then, we use three different views: one that 
weighs the bytes transferred, another that weighs the number 
of connections made, and the last one that weighs the security 
risk for the ports used (as described in Section HI). For any 
measurements that were conducted before this date, we only 
used the view based on the bytes transferred. 

Our input data-set is heavily sampled netflow from an ISP. 
In the first step, we remove all unimportant fields, leaving only 
the source-IP, destination-IP, source-port, destination-port, and 
the number of transferred bytes. This sums up to roughly 
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Fig. 4: Processed netflow-entries per second over the complete analysis execution 


SOGB of processed netflow per day. 

The community lists define a community with the IP address 
ranges of all its members and each community is stored in a 
separate file (the white list is simply a “special”? community). 
For example, if we wanted to add “TU-Dresden” to a “univer- 
sities community” we would add the following line into the 
corresponding file: 


Patel. 0.0 — 1414055255,255 TU.DRESDEN; DE 


If a company or institution has more than one IP address 
range assigned, we can simply add each range as a separate 
entry. Moreover, an entry in one community is allowed to be 
a member in other communities as well. 


B. Performance 


We implemented the parse, filter, and graph components on 
top of StreamMine [12], a highly scalable stream processing 
system. While StreamMine supports scaling to hundreds of 
physical machines, a scalability and performance evaluation 
involving multiple machines is out of the scope of this paper. 
Therefore, we only used a single machine with 24GB of 
RAM and 16 processing cores for the analysis. For the top-K 
algorithm we used a value of 100 for K. 

Figure 4 shows the read-throughput of the parse component 
of one such run in which we processed one day of netflow 
data (using only one view). The measurement was taken every 
second throughout the whole run. The parse component can 
read around 400,000 netflow entries per second with this single 
machine. Each entry is converted into a message and sent to 
the filter component. The filter component discards a large 
fraction of these messages and only send around one in a 
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hundred of the incoming messages to the graph component. 
Naturally, the read throughput varies over time, since the 
amount of processing that needs to be done in the system 
depends heavily on the content of the input data. However, it 
is important to note that the mean throughput stays constant, 
1.e., the system performance does not decline with time as 
more graphs are added. 

In the experiments reported in this paper, the filter com- 
ponent uses 13 of the available cores, since it has to filter 
the 400,000 netflow entries arriving every second. The graph 
component uses only one core since the amount of data it has 
to process is only a fraction of the data the filter receives. Note 
that even if one would assign more processing resources (1.e., 
cores) to the graph component, it would still be impossible 
to process unfiltered traffic (1.e., system without the filter 
component)— the system would simply run out of memory. 
The parse component uses the remaining two cores for reading 
the input files and parsing their contents. 

To avoid queuing, StreamMuine uses the TCP back-pressure 
mechanism on the network-connections. Hence, if a message 
cannot be processed by the filter component because all its 
threads are already busy processing other messages, the parse 
component will eventually stop sending new messages (the 
TCP send blocks if messages are not read fast enough on the 
other side). This will eventually lead to the parse component 
not reading any new netflow entries, because all its threads 
are blocked trying to send messages. 

Figure 5 shows the size of the daily alarm report (= 
number of suspicious [Ps communicating with the community) 
and community sizes (approximately the number of member 
organizations) over time for several months. The size of the 
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Fig. 5: Community and alarm sizes over time 


alarm report is subject to a weekly pattern with larger sizes 
for weekdays (for alarm reports produced from Tuesday to 
Saturday) and smaller for weekend traffic. The community 
lists and the white list were updated manually on a daily basis. 
Given a fixed community, the community list would typically 
stay relatively fixed but in our case we occasionally identified 
additional community members. For the alarm reports, we only 
plot the report sizes for communities E and C. We did not 
generate reports for the other communities because (1) we 
found E and C to be the most interesting ones and (2) because 
of time-constraints as we need to scan the reports manually for 
attacks and new members of the community or white lists. It is 
natural that the reports, especially initially, contain a number of 
false positives. Some of them will be new community members 
that have to be added to the community list, while others are 
companies and organizations that can be added to the white 
list. The white list is used to filter out trusted traffic, 1.e., from 
well known search engines, entertainment web sites, social 
media, popular CDNs, banking, government services, etc. 


In an actual usage of the system, the system administrators 
analyzing the alarm reports would also add other known 
“good” IP addresses to the white list to prevent them from 
being reported daily. Lacking such domain knowledge, our 
experiments used the white list conservatively. The bottom plot 
approximates the size of the daily alarm report under real us- 
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age scenario where suspected IP addresses are processed daily 
and either added to the white list or the suspect communication 
is stopped (e.g., clean up infected machine, add firewall rules). 
This alarm size 1s approximated simply by only listing the IP 
addresses that have not been reported before. 


V. CASE STUDIES 


While we do not typically know the ground truth, we have 
observed a number of suspicious cases in our analysis. In this 
section, we outline some of these examples. 


A. Case 1 


Table I shows an anonymized part of the report, generated 
for the netflow on May 13th, 2011. The report was obtained 
using the view based on the number of bytes transferred. It 
depicts the anonymized source-IP address (X.Y.Z.W) and the 
communities it was connected to, which ports were used (to 
help identify the application or service used), and a measure 
of the frequency of communication—the “Occurence”’ field 
indicates how often this connection was observed in the COI. 
In the actual report, the IP address and the exact community 
member are visible, of course. 

In the next step, we usually use the whois service, to 
determine to whom the IP address belongs. This way, we may 
also find new members of the community by looking up the 
company names, displayed in the whois information. For this 
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IP Address Src Port Community Dst port Occurrences 
X.Y.Z.W 6000 E 1433 2 
X.Y.Z.W 6000 E 1433 2 
X.Y.Z.W 6000 E 1433 ] 
X.Y.Z.W 6000 E,C 1433 ] 
X.Y.Z.W 6000 B 1433 1 
X.Y.Z.W 6000 E,C 1433 1 


TABLE I: Anonymized report-snippet (port-mapping) from 
May 13th, 2011 


particular example, the only information we could get, was 
that it belongs to an Asian ISP. Since the IP address likely 
does not belong to a company that the community members 
would typically collaborate with, we have a closer look at 
the ports being used. We assume that the lower port number 
(1433) belongs to the server and the higher port-number to the 
client (6000). Figures 6 and 7 show the output of the “SANS 
Internet Storm Center” web-site* related to port 1433. The 
web-site shows the services that usually run on these ports— 
in this example, “Microsoft-SQL-Server’. The SANS reports 
indicate many potential vulnerabilities, which may be used, 
for example, to steal data. 

Unfortunately, this is usually everything we are able to 
derive from the netflow alone. While we consider this to be 
a potential attack, final certainty could only be provided by 
the system administrators of the individual companies, given 
they have deeper knowledge about legitimate communication 
connections of each organization and access to lower-level logs 
on the targeted machines. 


B. Case 2 


Table II shows a summary of the COI of another 
anonymized IP address for August 8th, 2011. It shows the 
IP address, each community and two numbers. The report 
was generated using the view based on the security risk of 
used ports. The first number is simply a count of how many 
members of the current community had an entry in the COI 
of this IP address. The second number shows how often the 
IP address connected to other IP addresses that are in none of 
the communities. We stated in Section II that this number is 
a good indicator of the severity and specificity of an attack. 
Here, it is relatively low, which leads to the assumption that 
the connections were not driven by a brute-force or port-scan- 
like technique. 

To verify this intuition, Table HI shows the used ports 
for each community member individually. In contrast to the 
previous example, the source port is not constant anymore but 
seems to be chosen randomly. The destination port, however, 
is constant 445. Port 445 is usually used by “Win2k+ Server 
Message Block”. Note that every connection only appeared 
once in the netflow. This either means there was in fact just 
one connection being used or the attempt to connect failed. 

In the next step, we use again the whois service, to deter- 
mine that the IP address belongs to an European ISP. However, 


*http://isc.sans.org 
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# in # outside 


IP Address | Community Top-K Community 
X.Y.Z.W A 0 42 
X.Y.Z.W B 0 42 
X.Y.Z.W C 1 42 
X.Y.Z.W D 0 42 
X.Y.Z.W E 1 42 
X.Y.Z.W F 6 42 


TABLE II: Anonymized report-overview-snippet from August 
8th, 2011. The last two columns contain the following num- 
bers: (1) Number of members of the current community which 
had an entry in the COI of the current IP address and (2) 
number of connections to non-community members after the 
first connection to a community-member. 


IP Address Src Port Community Dst port Occurrences 
X.Y.Z.W 4798 F 445 ] 
X.Y.Z.W 1238 F 445 1 
X.Y.Z.W 1256 F 445 ] 
X.Y.Z.W 1682 F 445 ] 
X.Y.Z.W 3143 C,E,F 445 ] 
X.Y.Z.W 4243 F 445 ] 


TABLE HI: Anonymized report-snippet from August 8th, 2011 


it is not clear if this address belongs to a community member. 
An attempt to ping the address did not succeed. A query 
to “SANS Internet Storm Center” (Figure 8) shows a long 
list of reports about worms using this port with the famous 
“Conficker” being one of them. 

As with the previous example, we cannot determine if this 
case 1s a true attack. To this end, we would need the help of the 
system administrators of the various community members who 
have access to the log-files of the corresponding machines. 
However, there are two interesting points concerning this IP 
address. First, there are only a total of 69 entries in the netflow, 
where this address is the source of communication. Second, all 
connections transfer only a very small amount of data—around 
60 bytes each. Even in total, this only sums up to several kilo 
bytes. Therefore, this address only appears in the ports view 
and not in the other views that consider either the number of 
bytes or connections. Hence, an administrator would need to 
set the detection threshold very low to see an alarm concerning 
this address. 


C. Case 3 


In contrast to the previous two cases, this case is not an 
attack. It occurred in all views and if one only looks at the 
report (an excerpt is shown in Table IV), it is not immediately 
clear what service is being used since the address seems to be 
using random ports on both ends of the communication. The 
query to whois does also not reveal any useful information, 
except that the address belongs to a US ISP. 

However, looking at the connections with IP addresses 
outside of the communities provides a hint that this is not 
targeted against any of our specified communities as shown in 
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User Comment 
HPVs mek 


et 
: Marcus H. Sachs, SANS Institute 3 2003-10-10 00:50:59 


: SANS Top-20 Entry: W2 Microsoft SQL Server (MSSQL) http:isc.sans.org/top20.html#w2 The Microsoft SQL Server (MSSQL) 
: contains several serious vulnerabilities that allow remote attackers to obtain sensitive information, alter database content, 
‘compromise SQL servers, and, in some configurations, compromise server hosts. MSSQL vulnerabilities are well-publicized 
and actively under attack. Two recent MSSQL worms in May 2002 and January 2003 exploited several known MSSQL flaws. 
‘Hosts compromised by these worms generate a damaging level of network traffic when they scan for other vulnerable hosts. 


Johannes Wilrich : 2002-10-10 17:21:35 


: Port1433 is used by Microsoft SQL Serer. SOLSnake is one worm taking advantage of SOL Server installs without password. 
}AS SQL Server is able to run batch files and command line programs, it can be used to download and install malware. Basic 
: Protection: Use good passwords for all SOL Server accounts. 








Fig. 6: Screenshot of “http://isc.sans.org/port.html?port=1433” from September 8th, 2011 


CVE Links 


iCVE- | 
(1999- : "Vulnerability in the Wguest CGI program." 
287 


: CVE- : “The xp_displayparamstmttunction in SOL Server and Microsoft SQL Server Desktop Engine (MSDE) does not properly 
:2000- :restrictthe length of a buffer before calling the srv_paraminfo function in the SQL Server API for Extended Stored 
:1081 | Procedures (xP) 


, CVE- | “The xp enumresultsetfunction in SQL Server and Microsoft SQL Serer Desktop Engine (MSDE) does not properly 
:2Z000- irestrictthe length of a buffer before calling the srv_paraminfo function in the SQL Server API for Extended Stored 
:1082 | Procedures (XP) 


me : “The xp showcalv function in SQL Server and Microsoft SOL Server Desktop Engine (MSDE) does not properly restrict the 
1ogz , /@ngth of a buffer before calling the Smv_paraminto function in the SQL Server AP! for Extended Stored Procedures (xP) 
otal : “The xp_updatecalvom function in SQL Server and Microsoft SQL Server Desktop Engine (MSDE) does not properly restrict : 


: ithe length of a buffer before calling the sw_paraminfo function in the SQL Server API for Extended Stored Procedures (XP) 
: CVE- : "The xp peekqueue function in Microsoft SQL Serer 2000 and SQL Serer Desktop Engine (MSDE) does not properly 
:2000- | restrictthe length of a buffer before calling the srv_paraminfo function in the SQL Server API for Extended Stored 

71085 3 Procedures (XP) 


: CVE- 2 “The xp_ printstatements function in Microsoft 3QL Server 2000 and SQL Server Desktop Engine (MSDE) does not properly : 
(2000- :restrictthe length of a buffer before calling the srv_paraminfo function in the SQL Server API for Extended Stored 
: 1086 : Procedures (XP) 


: CVE- : "The xp SetsQLSecurity function in Microsoft SOL Sever 2000 and SQL Server Desktop Engine (MSDE) does not properly : 

:2000- :restrictthe length of a buffer before calling the smv_paraminfo function in the SQL Server API for Extended Stored 

:1088 | Procedures (XP) 

; 3 "Buffer overflows in Microsoft SQL Server 7.0 and 2000 allow attackers with access to SQL Server to execute arbitrary code 
: through the functions (1) raiserror 


542 

CVE. | : 
2002- : "The registry key containing the SQL Server service account information in Microsoft SQL Server 2000 
b42 


Fig. 7: Screenshot of “http://isc.sans.org/port.html?port=1433” from September 8th, 2011 
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: Direct Host) 445 TCP “additional protocol definitions thatwere created on the internal ISA Server firewall to all servers in the DMZ (IIS and DNS) to join and 

: participate in the domain, and for the management agents installed on these servers to be able to forward information packets to the internal management 
servers." Table 2 New Protocol Definitions Protocol Definition Name - Direct Host (TCP) Internal Connection Port Number - 445 Initial Protocol - TCP Initial 

: Direction - Inbound" “Active Directory Replication over Firewalls; Full dynamic RPC - Cons - Turns the firewall into "Swiss cheese" - Server message block 

: (SMB) over IP (Microsoft-DS) 445/tcp, 445/udp. (Ask Us About... Security, March 2001 by Joel Scambray http://support microsoft. com/defaultaspx?scid=KB;en- ; 
'us;289241éamp; ) Limited RPC - SMB over IP (Microsoft-DS) 445/tcp, 445/udp" "KCCC: Exchange 2000 Windows 2000 Connectivity Through Firewalls - Enable : 
: Windows 2000 Server-based computers to log on to the domain through the firewall by opening the following ports for inbound traffic: 445 (TCP) - Server 
: message block (SMB) for Netlogon, LDAP conversion and distributed file system (Dfs) discovery." 


2009-10-04 18:45:22 


'AS Johannes Ullrich stated wiselyin his comment, 445 is also used bythe Win2k/ WinXP worm "Lioten" also Known as "iraq_oil.exe". Since a couple of weeks 2 
or so firewall logs show a heightened incoming activity on Port 445, very likely due to this worm. FY, following links can help you outwhen needed. http:/Awww.t- : 
: secure.comv-descsilioten.shtml http/NVil.nai.comivil/contentv_ 99897 htm http://securityresponse.symantec.com/avcentervenc/datafw32.hilw.lioten.html Stay =; 


i happy, stay clean. 


2009-10-04 16:45:22 


? New Worm detected by Symantec on 06/07/03. Maybe what we are seeing the last couple of days. W32.Randex.B is a network-aware worm that will copy itself 
: to the following paths: \Admin$\system32\msslut32 exe \c$iwinntsystem32\msslut32.exe on computers with weak administrator passwords When 

:'W32 Pandex.B is executed, it does the following: Caclulates arandom IP address for a computer to infect. The worm will not infect computers with IP 
addresses in the following ranges: 10.0.0.0 -@gt 10.255.255.255.255 172.16.0.0 -&gt 172.16.255.255 192.168.0.0 -@gt 192.168.255.255 127.0.0.0 -&gt 

2 127.255.255.255 240.0.0.0 -@gt 240.255.255.255 Attempts to authenticate itself to the aforementioned randomly-generated IP addresses using one ofthe 
‘following passwords: &ltblank&gt admin root] 111 123 1234 123456 654321 |@2$ asdf asdfgh |@#$% |@2$%" |@Fh%"&amp; |@F$%"Bamp;* server 

: Copies itself to computers (with weak administrator passwords) as the following: \&lt authenticated IP&gt\Adming$\system32imsslut32 exe \eltauthenticated 
:IP&gtic$iwinntsystem32imsslut32.exe Schedules a Network Job to run the worm: Adds the value: "supersiut"="msslut32.exe" to the registry key: 

PHKEY LOCAL MACHINE\SOFTWAPRE\MicrosoffiWindows\CurrentVersian\Pun so that the worm runs when you start Windows. 


2009-10-04 18:45:22 


2009-02-09 15:56:30 


Pits Conficker.B hammering the port atthe moment. It operates in several modes (not at same time). One mode tries to get outto sites on web and the other 

: tries to crack passwords on accounts (I think it starts by going through hostfile..)this results in account lockouts)- the 2 together form avery effective DDoS on 
‘ corporate networks - causing major DNS/AD problems. Not sure if there is third mode which is just spreading itself (or whether the other 2 do that)- it sets 

: scheduled jobs to rundil multiple infections at once. From my experince Oct MS patch doesnt always work. Tuesday's patch from MS and updated malicious 
: Software removal tool better. We have cured about 600 infected servers and PCs and still got some to go... 


2008-12-11 01:08:28 


Hmm, seems like anew variation has broken out sources/dayx 61 guess alot of people have been infected by the Gimmiv.A virus this weekend: 


http:/blog threatexpert com/2008/10/gimmiva-exploits-zero-day-vulnerability. html 


2006-01-07 00:30:59 


We have some clients with malwares and process: adtech2006a Access to page: http/AwwwTfindthewebsiteyouneed.com/ Scans sequential ips (10/seqg) using : 
445 port. Solutions: ad-aware se and windows update if necessary. Some clients with an anti-spyware not detected malware or malwares. 


2005-06-22 02:40:54 


i Ala owwee hit hore suith WDD db ata thoato ae neiote dd with Hao AIS | CASS nila aeohklihs feo A 11 Wl b§odd omen enoehinane thot t+ aoteh al ane 


Fig. 8: Screenshot of “http://isc.sans.org/port.html?port=445” from September 8th, 2011 


IP Address Src Port Community Dst port Occurrences 
X.Y.Z.W 13397 B 38426 ] 
X.Y.Z.W 41748 F 41387 1 
X.Y.Z.W 49534 C 23068 1 
X.Y.Z.W 16249 C 22654 1 
X.Y.Z.W 29167 C 43183 Z 
X.Y.Z.W 20 F 7205 4 


TABLE IV: Anonymized report-snippet from August 8th, 2011 


Table V. Moreover, the use of port 20 (the last line in Table IV) 
gives a hint that at least some part of the communication 
involved anonymous ftp, which uses port 20 to initiate the 
connection but uses random ports thereafter. Finally, using 
an ftp-client (1.e., a web-browser) revealed indeed that this 
is simply an ftp server hosting software updates. As a result 
of this analysis, we added the address to the white list. 
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IP Address | Community ok Ca 
XYZW A 0 14250 
XYZW  B 2 14250 
XYZW C l 14250 
XYZW  D 0 14250 
XYZW  E 0 14250 
XYZW  F 6 14250 


TABLE V: Anonymized report-overview-snippet from August 
8th, 2011. The last two columns contain the following num- 
bers: (1) Number of members of the current community which 
had an entry in the COI of the current IP address and (2) 
number of connections to non-community members after the 
first connection to a community member. 


D. Building Communities 


In real use of the system, the community members might 
be known a priori and even stay relatively fixed. However, 
in our case we built the community lists incrementally by 
identified new community members based on the COIs gener- 
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ated. Specifically, we assumed that members of a community 
exchange information with one another and often the data 
exchange is encrypted. Therefore, we focused on new IP 
addresses that used the https-port (443) for communication. 
However, a certain minimal set of known members is needed 
before reports can be generated. This set should be as large as 
possible for two reasons. First, the likelihood that an unknown 
address that belongs to the community (and thus, should be 
added) connects to one or more entries of a large set of 
members is higher than if the set contains only very few 
entries. Second, if the set is large, one can set the reporting 
threshold higher and reduce the amount of noise. 

Building a community this way is a task that lasts for 
weeks, depending on how much communication is observed 
between the individual members and how large the community 
is initially. We start by adding the new community to the list of 
communities. With every subsequent report, we scan for new 
members and add them to the corresponding lists. This way, 
the community grows every day, and with it the likelihood 
of finding any missing members. The community stabilizes 
eventually with fewer and fewer new members per day. 


VI. RELATED WORK 


A number of tools and techniques have been developed to 
process and visualize netflow data(see [17] for a survey). Net- 
flow processing tools include OSU flow-tools [16], SiLK [7], 
and Nfdump?. In addition to command line tools, numerous 
graphical user interfaces exist to visualize and query network 
activity, including NTOP®, Nfsen [9], NfSight [1], VisFlow- 
Connect [20], FlowScan [14], NetPY [2], FloVis [18], VIAssist 
[5], and NFlowVis [6]. While visualization tools allow the 
users to view the netflow data from different perspectives 
to locate suspicious activity, our approach analyzes the data 
and produces small number of meaningful alarms each day. 
Also, our focus on communities allows us to detect attacks 
and suspicious behavior that is focused on a potentially small 
community, but would not show significantly on a global scale. 

Detection of similar communication behavior in multiple 
hosts has been used previously to raise suspicion that hosts 
with the correlated behavior may be members of the same 
botnet. For example, [21] uses netflow data to identify sets of 
suspicious hosts and then uses host level information (collected 
on each host by a local monitor) to confirm or reject the 
suspicions. However, detection of botnets is simplified by the 
fact that the bots typically act in unison (e.g., start spamming 
or DDoS attack against a target at the same time). Indeed, 
much of the work in this area (e.g., BotMiner [8]) specifically 
build detection mechanisms based on the assumptions of the 
communication behavior required for a botnet. Furthermore, 
to our knowledge, prior work is limited to detecting similar 
behavior within one organization. 

The concept of using a community to help detect security 
events has been used in the past. For example, the Ensemble 


>http://nfdump.sourceforge.net 
Shttp://www.ntop.org 
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[15] system detects applications that have been hijacked by 
using the idea of a trusted community of users contributing 
system-call level local profiles of an application to a com- 
mon merging engine. The merging engine generates a global 
profile that can be used to detect or prevent anomalies in 
application behavior at each end-host in real time. A similar 
concept of collaborative learning for security [13] is applied 
to automatically generate a patch to the problematic software 
without affecting application functionality. PeerPressure [19] 
automatically detects and troubleshoots misconfigurations by 
assuming that most users in the community have the correct 
configuration. Cooperative Bug Isolation [11] leverages the 
community to do statistical debugging based on the feedback 
data automatically generated by community users. Vigilante 
[4] apply the community concept for containment of Internet 
worms by community members running detection engines on 
their machines, where the detection engines distribute attack 
signatures to other community members when a machine is 
infected. 


VII. CONCLUSIONS 


In this paper, we have presented a community-based analy- 
sis and alerting technique for detecting small-footprint attacks 
targeting communities of interest for attackers such as financial 
institutions, e-commerce web site, or the electricity generation 
and distribution infrastructure. By comparing communication 
behavior across the member organizations in the community, 
it is possible to detect suspect behavior that may fall below 
detection thresholds at individual member organizations. A 
white list can be used to avoid repeating false positives. 
We have implemented the analysis algorithm in a scaleable 
distributed architecture that can process large volumes of 
netflow data efficiently. 
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Abstract 


This work presents the Web Classifying Immune System 
(WCIS) which is a prototype system to detect zero-day 
attacks against web servers by examining web server re- 
quests. WCIS is intended to work in conjunction with 
more traditional intrusion detection systems to detect 
new and emerging threats that are not detected by the 
traditional IDS database. WCIS is at its core an artifi- 
cial immune system, but WCIS expands on the concept 
of artificial immune systems by adding a classifier for 
web server requests. This gives the system administra- 
tor more information about the nature of the detected 
threat which is not given by a traditional artificial im- 
mune system. This prototype system also seeks to im- 
prove the efficiency of an artificial immune system by 
employing back-end, batch processing so that WCIS can 
detect threats on higher capacity networks. This work 
shows that WCIS is able to achieve a high rate of ac- 
curacy at detecting and classifying attacks against web 
servers with very few false positives. 


Tags: Research, Security, Web, Artificial Immune Sys- 
tem 


1 Introduction 


Traditional intrusion detection systems (IDS) are very ef- 
ficient at detecting known threats and even some emerg- 
ing variants, but are not as effective at detecting zero-day 
attacks. Artificial immune systems (AIS) are appealing 
for detecting zero-day attacks because they are inspired 
by the adaptive concepts of biological immune systems. 
Biological immune systems are alluring to the computer 
security realm because they can innately adapt to new 
pathogens or variations on previously seen pathogens, 
something which even modern intrusion detection sys- 
tems struggle to do. The primary goal of an artificial 
immune system is to apply these biological principles to 


the problem of distinguishing normal traffic or data from 
abnormal traffic or data, even if the abnormal traffic cor- 
responds to a completely new attack. 

This work presents a variation of the artificial immune 
system concept called Web Classifying Immune System 
(WCIS). WCIS is intended to work in concert with a tra- 
ditional IDS, scanning the traffic that the IDS has labeled 
as normal to see if there is a zero-day attack, or even just 
a new, unknown variant of an existing attack, present in 
the traffic. As the name implies, WCIS focuses on at- 
tacks conveyed in web server requests. While the con- 
cepts can apply to other problem domains, this work fo- 
cuses on web server requests as a “proof of concept”. 

There are limitations to the traditional AIS model that 
WCIS seeks to overcome. Most traditional artificial im- 
mune systems only provide this binary classification of 
traffic or data as “normal” or “attack”. For many prob- 
lem domains, particularly the problem domain of mali- 
cious web server requests, this simple classification is not 
sufficient. There are a variety of web server attacks rang- 
ing from simple information gathering via HEAD or OP- 
TIONS requests to attacks that attempt to execute code 
on the web server. The administrative response to an at- 
tack will vary based on the type of attack. The prototype 
system presented in this work overcomes this limitation 
by adding classifications to a traditional AIS. 

Since WCIS classifies the attacks as they are detected, 
this provides the web administrator with more informa- 
tion about the nature of the attack than a simple alert 
would provide. For example, an attack which has a di- 
rectory traversal component would require different con- 
figuration changes than a CGI or PHP script with a buffer 
overflow. By providing classifications along with alerts, 
WCIS can help direct the administrative response to a 
zero-day attack more effectively. The administrators 
might not know the name of the attack, but if they know 
it’s a buffer overflow on index.ida, that will allow them 
to focus their response far more than they could with an 
“attack detected” alert provided by traditional artificial 
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immune systems. 

Another limitation of traditional artificial immune sys- 
tems is the training of the immune system “antibodies”, 
e.g. the sensors for detecting attacks. The traditional AIS 
model assumes a continual process of evolution occur- 
ring in real-time as it sees and classifies network traffic. 
Most evolutionary algorithms require extensive memory 
and CPU cycles to operate. This leads to two main is- 
sues using AISes when high-volume, real-time detection 
is desired: the sensors take a long time to train, during 
which they are not capable of accurately labeling traffic, 
and sensor refinement after initial training can cause a 
CPU and/or memory bottleneck that limits the volume of 
traffic that the sensors can process. 

WCIS seeks to minimize these issues by separating 
the evolutionary processes from the detection process. 
The evolutionary processes, pre-deployment training and 
sensor refinement, occur “offline” on a back-end system. 
The detection process, monitoring the network traffic, 
occurs “online” in real-time on the network. The “of- 
fline” evolutionary processes produce a set of sensors, 
which essentially detect patterns in the traffic, that are 
deployed to monitor the network traffic in real-time. It 
should be noted that the “online” mode of WCIS is in- 
tended to work in conjunction with a traditional IDS by 
scanning the traffic which the traditional IDS has not 
alerted upon. WCIS however does not produce tradi- 
tional IDS rules as those rules would be unable to gather 
the statistics at the sensor, classification population and 
overall population levels that are needed for sensor re- 
finement. 

In order to maintain one highly desirable feature of an 
AIS, the customization of the sensors for that particular 
network’s traffic, WCIS uses a system profile to train the 
sensors in the pre-deployment phase. These profiles in- 
clude a sampling of normal traffic for the network which 
will be used to train the AIS and a set of labeled attacks 
that will be used to “prime” the classifier. The proto- 
type implementation of WCIS takes Apache logs as the 
source of these two datasets, which makes customization 
of the datasets very easy. One simply has to copy log 
entries over into the appropriate dataset file and rerun the 
pre-deployment phase of WCIS. 

To enable “offline” sensor refinement, the “online” 
WCIS sensors record statistics about their detection and 
classification rates at the individual sensor, classification 
population and overall sensor population levels. This in- 
formation can be sent to a back-end system, which will 
enable WCIS to run the sensor refinement process as a 
batch process on the back-end system while the live sen- 
sors keep detecting. Once the batch process is complete, 
the live sensors can be replaced with the newly refined 
(“next generation’) sensors. The current prototype does 
not yet implement this aspect as the prototype could not 
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be run on live network traffic due to policies and bu- 
reaucratic limitations about collecting data that may con- 
tain personal or confidential information at the university. 
However, it is already supported by the internal structure 
of the sensors and merely requires a live network (or iso- 
lated network) test environment to implement and fully 
test this feature. 

In summary, WCIS is a variation of an artificial im- 
mune system that is intended to work in conjunction with 
a traditional intrusion detection system to detect attacks 
that the IDS cannot yet detect. WCIS seeks to overcome 
the usability limitations of traditional artificial immune 
systems by adding a classifier to provide more informa- 
tion about detected attacks. Additionally, WCIS seeks to 
optimize the scalability of the AIS concept by separating 
the evolutionary processes from the detection process. 
This allows the resource intensive aspects of an AIS to 
occur “offline” on a back-end system rather than on the 
detection system. 

Section 2 provides an overview of artificial immune 
systems and the biological principles that inspired them. 
Related work in the area of artificial immune systems and 
classifiers is presented in Section 3. The methodology 
used to add classifications to an artificial immune sys- 
tem is described in Section 4. Section 5 describes how 
WCIS models web server requests. The results of run- 
ning WCIS on sample datasets is presented in Section 
6 and conclusions drawn from these results are given in 
Section 7. Finally, future avenues of research and im- 
provement for WCIS are discussed in Section 8. 


2 Artificial Immune Systems 


An artificial immune system (AIS) is a type of anomaly- 
based intrusion detection system (IDS) inspired by the 
adaptive nature of the biological immune system. A 
biological immune system has to be responsive to new 
and unknown pathogens while also recalling previously 
defeated pathogens to prevent a recurrence of illness. 
While not 100% effective at this task (e.g. auto-immune 
disorders and other immune system malfunctions), the 
biological immune system is more adaptive to new 
pathogens and variants of known pathogens than the 
analogous anomaly-based IDSes. 

Using biological methods to create a better IDS is the 
core concept behind artificial immune systems. The goal 
of an AIS is to distinguish normal traffic (called “self” 
data) from abnormal traffic (called “non-self” data). It 
does so by creating immune “sensors” as analogs to bio- 
logical immune cells. These sensors use pattern match- 
ing functions to determine if data is “non-self’’. Several 
key features of a biological immune system that serve as 
inspiration for an AIS are affinity maturation, negative 
selection and peripheral tolerance. Other features of bi- 
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ological immune systems can also be incorporated, but 
these are far more weighty concepts that are beyond the 
scope of this paper. 

Affinity relates to pattern matching. Each immune cell 
(antibody) has a set of proteins on its surface that form 
a three dimensional “lock” pattern which can match the 
“key” pattern of proteins on the surface of a pathogen 
(also called an antigen). Affinity measures how “tightly” 
the lock and key patterns fit together, with a higher affin- 
ity meaning a tighter bond between the antibody and 
pathogen exists. Affinity maturation is the process of re- 
fining an antibody’s lock pattern until it can tightly bind 
to a specific pathogen. This allows the body to “mem- 
orize” specific pathogen patterns, e.g. learn a “signa- 
ture” for that pathogen. This is the basis of immuniza- 
tions in biological immune systems. For AlISes, affinity 
maturation allows generic immune system sensors to de- 
velop “signatures” for novel attacks or new variants of 
old attacks. This is accomplished by training the sensors 
against attack data in the pre-deployment phase and by 
refining the sensors during deployment using an evolu- 
tionary technique, such as a genetic algorithm. 

Negative selection is a process for creating new im- 
mune cells that do not react to the body’s own proteins 
(“self”). Most of the artificial immune system works fo- 
cus on this feature of biological immune systems. The 
immune cells are initially created with a random pattern 
of “lock” proteins. The cells are then tested against a 
random sampling of “self” proteins and structures. If the 
immune cell has too high of an affinity for “self”, it 1s de- 
stroyed. For AlSes, this means the immune sensors are 
initialized with random patterns and each sensor 1s tested 
against a sample of “normal” data. Those which react too 
strongly to normal data are removed and replaced with a 
new randomly generated and tested sensor. A negative 
selection phase can be used along with affinity matura- 
tion to be sure that the sensors do not start reacting to 
normal data while they are developing an affinity for at- 
tack data. 

Since negative selection uses a random sampling of 
“self” proteins to test new immune cells, there is a pos- 
sibility that cells which are reactive to self will survive 
negative selection. Auto-immune disorders are caused 
by such cells. In an AIS, such sensors would lead to false 
positives, where normal traffic is labeled as an attack. 
The immune system has some protection against this by 
using peripheral tolerance. Peripheral tolerance deacti- 
vates or destroys immune cells that are too reactive to 
self proteins. Not many AISes explore the use of periph- 
eral tolerance in their systems since it is hard to detect 
false positives automatically. One technique might be to 
have a human verify each alert and deactivate any sensor 
which is noted to have an excessive number of false pos- 
itives. In WCIS, the person can also modify the sensor’s 


internal statistics to mark the sensor as “bad”, which will 
prevent the sensor from being used to refine the sensors 
during the next sensor refinement phase. This essentially 
removes the sensor from the “genetic pool” used for sen- 
sor refinement. 


3 Related Work 


The research group of Stephanie Forrest at the University 
of New Mexico has produced several pioneering works 
in the field of artificial immune systems. Forrest, ef 
al. [9] focused on distinguishing self from non-self and 
laid the foundations for the negative selection algorithm. 
Somayaji, Hofmeyer and Forrest [16] explored the ap- 
plication of these concepts to computer security. This 
work ultimately resulted in the production of the LY- 
SIS [12, 13] immune system for TCP connections. LY- 
SIS monitored the TCP/IP headers of SYN packets to 
detect abnormal traffic. 

Williams, et al. [22] expanded LYSIS to monitor TCP, 
UDP and ICMP traffic. This system, called CDIS, also 
monitored all packets instead of just TCP SYN pack- 
ets. Each AIS sensor in the system monitored a random 
subset of features from the packet headers. The pattern 
matching function used by CDIS used a mix of binary, 
discrete and real value features. 

Gonzales, Dasgupta and Gomez [10] showed that the 
negative selection algorithm is very sensitive to the type 
of matching function used. Ultimately, one hopes that 
negative selection results in sensors with a wide cover- 
age of the non-self space, as this represents potential at- 
tacks. But [10] showed that the algorithms of Forrest, 
et al. [1, 9, 12, 13, 16] and Farmer, et al. [8] resulted 
in restricted coverage of the non-self space. These algo- 
rithms work best with binary and discrete data. Of the 
algorithms tested, the real value matching function used 
by Gonzales, Dasgupta and Kozma [11] had the best cov- 
erage of the non-self space. 

In Dasgupta, Yu and Majumdar [5], a multilevel im- 
mune learning algorithm was introduced, in part to over- 
come deficiencies in simple negative selection algo- 
rithms. This system used collaborations and interactions 
between various types of sensors, analogous to the vari- 
ous types of immune system cells in a biological immune 
system. By requiring collaborations between sensors to 
label data as non-self, the experiments showed the AIS 
achieved better results than simple negative selection. 

The first version of WCIS that was published [2, 4] 
was built off the work of CDIS and LYSIS to monitor 
web server requests. As with CDIS, the sensors moni- 
tored a random subset of features and the pattern match- 
ing function used a mix of binary, discrete and real value 
features. The system also incorporated basic collabora- 
tion between sensors to reduce the false positive rate. 
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Table 1: The classification scheme used for the web 
server attacks. 
Class Instances Description 
info 5 Gathers information about 
server (read only) 
traversal 37 Directory traversal attempt 
(read only) 
sql 4 SQL injection attack 
buffer 7 Buffer overflow attack 
script 86 Cause a script to do something 
malicious (execute) 
XSS AO Cross site scripting 


This first version of WCIS was simply an AIS for web 
server requests and included none of the enhancements 
that this work covers. The concept of adding a classi- 
fier to WCIS was explored in [3], but the classifier used 
in that version was prone to overfitting and poor clas- 
sifications and was deemed unsuitable. Neither previous 
version separated the evolutionary processes from the de- 
tection process. 

Watkins, Timmis and Boggess [17, 18, 19, 20, 21] pro- 
posed an artificial immune recognition system for for su- 
pervised learning and reinforcement learning. The pro- 
posed AIS functioned as a classifier. As with [5], it mod- 
eled a variety of immune cells working in collaboration 
to classify data. It required the features be represented as 
a vector of real value ranges and used vector mathematics 
to calculate affinity and distance between cells. A vari- 
ation on k-nearest neighbors was used to calculate the 
class of unknown data once the cells had been trained. 
While this method worked well on datasets that can be 
modeled as a feature vector, its mathematical approach 
limits its application to other feature sets that cannot be 
easily modeled as a vector. 


4 Methodology for the Classifying AIS 


Previously [2, 3, 4], WCIS was defined as an AIS for 
web server attacks and a rudimentary, but poor, classifier 
was implemented. The scheme for fingerprinting web 
server requests, detailed in Section 5, was developed in 
those works. The classifier developed in [3] was prone to 
overfitting and misclassification. A better classifier was 
developed, which is the focus of this section. The sim- 
ple classification scheme given in Table 1 was preserved 
from [3] however, as the classification scheme was not 
the issue with the previous classifier. 

The classification scheme in Table 1 was developed 
based on several common groups of web server re- 
quest attacks that can be found encoded in URIs. The 
“info” classification covers various information gather- 
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ing attacks that do not alter the server. Likewise, the 
“traversal” category solely covers the attacks which uti- 
lize directory traversal, but do not attempt to execute 
anything on the web server, such as attempting to read 
/etc/passwd. If the traversal tries to execute a program, 
it is instead labeled “script”. The “script” class also cov- 
ers other attempts to maliciously execute a program or 
script on the web server. The “sql” class covers SQL in- 
jection attacks. The “buffer” class covers buffer overflow 
attacks, which may also result in commands being exe- 
cuted. Finally, the “xss” class covers cross site scripting 
attacks. Table 2 lists some examples for each class ex- 
cept buffer overflow attacks as those examples were too 
long to easily fit into the table. 

The classification training occurs during the pre- 
deployment stage where the field of potential sensors is 
trained against a system profile. The system profile con- 
sists of a normal dataset (Apache log entries from non- 
malicious web requests) and an attack dataset (Apache 
log entries from actual attacks on a web server). Each 
attack in the attack dataset was hand inspected and la- 
beled with a classification. One main issue faced while 
developing the attack dataset was obtaining sufficient ex- 
amples of each classification of attack. Attack exam- 
ples were gleaned from Bugtraq [15], live Apache web 
servers and an un-networked machine where selected at- 
tacks were run against a local web server. As seen from 
Table 1, most of the examples fell into the category of 
traversal, script or xss. To prevent the sensors from be- 
coming biased towards those classes, each sensor tracks 
the percentage of the class that it is able to detect rather 
than a raw count. 

To add classification to WCIS, each sensor not only 
tracks the percentage of each category it reacted to dur- 
ing pre-deployment training, it also has a desired cat- 
egory for which it should develop affinity. Previously 
in [3], WCIS did not have this second feature and it 
was discovered that the population of sensors optimized 
for the “script” and “traversal” classes. To prevent this 
from happening, the sensors were divided into groups 
and each group was tasked with optimizing affinity for 
a particular classification. This is a niching algorithm, 
which is intended to develop “specialists” for all classifi- 
cation labels. 

To optimize affinity, the sensors must be trained and 
matured. This is accomplished with a typical artifi- 
cial immune system lifecycle conducted during the pre- 
deployment and sensor refinement phases. The lifecy- 
cle is an iterative process which repeatedly applies the 
affinity maturation steps. This results in a set of trained 
sensors that have higher affinity towards attacks than the 
initial sensors. The steps for the lifecycle are detailed in 
the following subsections. 

The primary difference between the pre-deployment 
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Table 2: A sample of requests in the attack dataset. 


Class URL 

info GET x HTTP/1.0 

traversal GET ../../../boot.ini HTTP/1.0 

traversal GET %2E%2E/%2E%2E/%2E%2E/%2E%2E/%2E%2E/winnt/win.ini HTTP/1.0 
sql GET /scripts/test.asp?var=foo*;EXEC master.dbo.xp_cmdshell’cmd.exe’ HTTP/1.0 
script GET /scripts/..%%35%63../winnt/system32/cmd.exe?/c+cmd.exe HTTP/1.0 

script GET /ans.pl?p=../../../../bin/command%20argument|&blah HTTP /1.1 

XSS GET /<script>alert( Vulnerable’)</script> HTTP/1.1 

XSS GET /javascript:void%20window.open( HT TP/1.0 


Table 3: A sample of requests in the normal dataset. 


GET /OOmaster/hgafgate.gif HT'TP/1.0 
GET /Copy%200f%2010.gif HTTP/1.0 
GET /faq/web/viewfaq.php3 HTTP/1.0 


GET /forums/newmsg.php?fid=2&pid=30 HTTP/1.1 


GET /index.html?browsePage=commands.html HTTP/1.1 

GET /index.html?browsePage=kb/item_detail.php&id=19 HTTP/1.1 

GET /index.html?secure=1 &PHPSESSID=db80c486ee8cef8090a532b93619cd7a HT TP/1.1 
GET /%7E930www/Images/front_y2k_logo02.jpg HTTP/1.0 

GET /ADTracker.asp?linkid=AHCX030&linktype=Room&RID=8 HTTP/1.0 

GET /CGI-BIN/centralad/getimage.exe/19980714243?GROUP=default_buttons HTTP/1.0 


and sensor refinement phases is the source of the statis- 
tics used for training. In the pre-deployment phase, train- 
ing statistics come purely from the sensor’s reaction to 
the system profile datasets. In the sensor refinement 
phase, statistics come from the sensor’s reaction to live 
traffic, with negative selection against the normal dataset 
also conducted to prevent sensors from reacting to nor- 
mal traffic. 


Before going into the details of the pre-deployment 
phase, some key terminology should be reviewed. The 
sensor population size is the number of unique sensors 
being processed. Each individual sensor within the pop- 
ulation has its own data structure to store its pattern, clas- 
sification label and statistics. Patterns may be repeated in 
multiple individual sensors within the population. This is 
called a loss of diversity or overfitting which essentially 
leads to redundancy (e.g. multiple sensors have the same 
signature’). The sensor lifecycle is the process of cre- 
ating, refining and perhaps destroying individual sensors 
within the population. Throughout the lifecycle, the pop- 
ulation size remains constant. Every destroyed sensor is 
replaced with exactly one sensor. The sensors that exist 
in each iteration through the lifecycle process are called 
a generation of the population. Each new generation is 
generated by the affinity maturation process, which uses 
a genetic algorithm to refine the sensor population as a 
whole. The sensor’s chromosome is a method to repre- 
sent the sensor’s pattern by using data structures that can 
be manipulated by a genetic algorithm. The chromosome 


contains all possible features that a pattern in WCIS may 
use (see Section 5 for a description of the features), the 
current values for each feature and a flag to indicate if 
the sensor is using that feature in its pattern (e.g. if the 
feature is expressed in that particular sensor). The fit- 
ness of a sensor is determined by its statistics and is used 
to gauge its accuracy at detecting attacks in its classifica- 
tion label. The most fit sensors contribute more “genetic 
information” to the next generation than the less fit sen- 
sors. 


4.1 Lifecycle 


In pre-deployment training, a normal dataset, samples 
of which can be seen in Table 3, and the labeled attack 
dataset are given as input to the lifecycle function. The 
pre-deployment lifecycle begins by randomly generating 
a population of sensors for each classification group. The 
random generation process selects a subset of features for 
each sensor’s matching pattern and randomly assigns val- 
ues to those features. In the sensor refinement phase, the 
lifecycle function would instead begin with copies of the 
existing sensors and any sensors which have been deacti- 
vated by the system administrator (peripheral tolerance) 
will be discarded and replaced by a random sensor. 

For both phases, the iterative affinity maturation pro- 
cess 1s then entered, which refines the sensors over a se- 
ries of generations. It is important to note that affinity 
maturation occurs within each population for a classi- 
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fication label, not across all classification label popula- 
tions. The goal of affinity maturation is to produce sen- 
sors which specialize in detecting attacks for that partic- 
ular classification label, so each population is kept dis- 
tinct. 


4.2 Negative Selection Phase 


The affinity maturation process begins with negative se- 
lection. The population of sensors is compared to the 
normal dataset. Any sensor that has too strong of an 
affinity to requests in the normal dataset is discarded and 
replaced with a random sensor. The replacement is like- 
wise tested against the normal dataset and is not allowed 
to replace the discarded sensor until its random feature 
set (e.g. pattern) does not have strong affinity towards 
the normal dataset. The exact level of affinity towards 
the normal dataset that is tolerated in this phase is tun- 
able in WCIS. 


4.3 Training Phase 


After negative selection, the sensors enter two phases of 
training. During the first phase of training, the sensors 
are compared to all of the attack requests and a random 
subset of normal requests. If a sensor has affinity to an 
attack, it records the classification of that attack. At the 
end of the first phase, each sensor will know the percent- 
age of attacks in each category it can detect. It then sees 
which classification it is best at detecting and marks that 
classification as its class. The sensor may mark itself as 
a different classification than what its group is supposed 
to be optimizing for. This simply means the sensor is not 
as good at detecting the desired classification as it is at 
detecting a different classification. 

During the second phase of training, the sensors make 
a second pass over the attack dataset. For each attack, the 
sensors which can detect it vote on the classification of 
the attack. The accuracy of each group of sensors at de- 
tecting its desired classification is recorded. This second 
phase is purely for computing the accuracy statistics and 
does not affect the affinity maturation process. The accu- 
racy of the sensors during experimental testing is given 
in Section 6. 


4.4 Genetic Algorithm Phase 


After training, the sensors move on to the genetic al- 
gorithm phase. This phase first “breeds” the sensors to 
create the next generation of sensors and then mutates 
the next generation. The breeding phase uses a single- 
objective genetic algorithm which optimizes for a sin- 
gle fitness metric (multi-objective algorithms allow opti- 
mization for multiple fitness metrics). The fitness of each 
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sensor for this phase is its ability to classify the attacks 
in the desired classification for its population. For ex- 
ample, if a “script” sensor can detect 70 of the 86 script 
attacks, it would have a fitness of 0.814 even if it could 
also detect 100% of the “traversal” attacks. A secondary 
fitness value is also computed for each sensor but is not 
directly used by the genetic algorithm. This fitness value 
measures how well the sensor can detect attacks without 
excessive false positives. The secondary fitness ranges in 
value from -2 (all of its alerts are on normal requests in- 
stead of attack requests) to +2 (all of its alerts are attack 
requests). 


Rank selection with elitism using the primary fitness 
value is used to select the “parent” sensors. Rank se- 
lection chooses the most fit sensors to be parent sensors. 
Elitism allows a percentage of highly fit parent sensors 
to survive into the next generation. The exact percent- 
age is tunable in WCIS. Once two parent sensors are 
selected, single point crossover on the parents’ chromo- 
somes is used to create the chromosomes for the “chil- 
dren” sensors. The chromosome is the complete feature 
set, a subset of which will be expressed in each parent. 
The expressed feature set for each child sensor is the 
intersection of the expressed feature sets of the parent 
sensors. Additionally, a feature that only one parent ex- 
presses will be randomly expressed in the child. Even if 
the feature is not expressed, the child will still inherit the 
values for that feature from the parent. It just will not be 
used by the child to match against requests. But this pre- 
serves the genetic information in a dormant state in case 
future offspring randomly choose to express that feature. 
Finally, if a child exits this expressed feature selection 
phase with less than two features expressed, it randomly 
chooses features to add to its expressed feature set until 
the set size is two. 


Besides the children sensors created by crossover, ran- 
domly selected parent sensors are also be chosen as sur- 
vivors during the elitism process. The population for 
the next generation of the affinity maturation process is 
the combination of the children and the survivors. Addi- 
tionally, to prevent overfitting, breeding ceases when the 
population for a specific class achieves 100% accuracy 
at detecting that class. In that case, the next generation 
consists entirely of survivors. 


After breeding is completed, mutation is performed on 
the next generation. A subset of sensors is selected ran- 
domly from the population. A random expressed feature 
in the sensor’s chromosome is selected for mutation. If 
the feature is binary or discrete, a bit is flipped. If the 
feature is a real value, the value is altered by a random 
number. 
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4.5 Sensor Deployment and Refinement 


The lifecycle continues by iterating through the nega- 
tive selection, training and genetic algorithm phases un- 
til a maximum number of generations is reached. At this 
point, the sensors are considered trained (or refined), al- 
though they may not have perfect accuracy for their clas- 
sification. In the pre-deployment phase, the sensors with 
a secondary fitness greater than 0.5 will become the live 
sensors. In the sensor refinement phase, those sensors 
would replace the existing live sensors, as the “next gen- 
eration” of sensors. The threshold of secondary fitness 
may be refined to trade off between covering potential 
attacks and generating too many false positives. 


Live deployment of the sensors could not be tested due 
to bureaucratic issues obtaining the appropriate autho- 
rization for live monitoring of the department network. 
Since live network traffic could contain personally iden- 
tifying or confidential information, the campus requires 
assurance that WCIS will protect such information from 
unauthorized view before granting authorization. As of 
this time, the authorization is still pending. 


Since this bureaucratic restriction prevented the live 
deployment of sensors to test the concept, the sensors 
are instead presented with unlabeled data to see how they 
perform in a real-world scenario. Any sensor with a sec- 
ondary fitness less than the above threshold is not used 
for this phase as it has difficulty distinguishing normal 
requests from attack requests. The sensors determine if 
each unlabeled request is an attack or a normal request. 
If the request is labeled an attack by a sensor, the classi- 
fication of the sensor is recorded. After passing the unla- 
beled request past all sensors, the classification with the 
highest “vote” count is chosen as the class label for the 
request. Those results are then hand-verified to see their 
accuracy. The results of testing the sensors against un- 
known data are given in Section 6. 


The bureaucratic restriction also made it difficult to 
fully test the scalability of the pre-deployment, detec- 
tion and sensor refinement phases. In particular, this 
made it difficult to fully implement the back-end pro- 
cessing aspects of the sensor refinement phase, as there 
were no deployment and back-end systems to commu- 
nicate between. While WCIS contains the algorithmic 
components of sensor refinement, the practical aspects 
of deploying sensors, recording statistics, communicat- 
ing those statistics back to the back-end system, refin- 
ing sensors on the back-end system and re-deploying the 
next generation of sensors could not be fully investigated. 

The department is currently in the process of building 
an isolated network. The sensors can be deployed on the 
isolated network since the data will be simulated, which 
means campus authorization is not required. This will 
allow testing of the sensor refinement phase. Scalability 


Table 4: The special characters used in the fingerprinting 


method. 

Character Description 

% Used by various encoding methods 
such as hex encoding 

Used by SQL injection attacks 

+ Interpreted by Microsoft IIS as a 
space 
Used in directory traversal attacks 

\ Used in directory traversal attacks 
since URIs contain only / 

( Used in cross site scripting attacks 

) Used in cross site scripting attacks 

< Used in cross site scripting attacks 

2 Used in cross site scripting attacks 

// Used in proxy attempts or to exploit 


an old Apache vulnerability 


testing can also be conducted. Based on the promising 
results presented in Section 6, it is expected that WCIS 
will perform well in a simulated live environment. While 
this is still not an ideal scenario, it will allow continued 
development and testing of WCIS while the attempts to 
get campus authorization for live deployment continue. 


5 Fingerprinting URIs 


In order to adapt the AIS method to detect malicious web 
server requests in WCIS, the web request data must be 
converted into a pattern consisting of binary, discrete and 
real value features. The chromosome in each sensors 
would then seek to match these features. The features 
from the web request chosen for WCIS are the Uniform 
Resource Identifier (URI), the HTTP command (GET, 
POST, HEAD, etc) and the HTTP version. Additional 
features from the request, such as headers, referrer, and 
so on, could also be added as features, although they are 
not supported at this time in WCIS due to the nature of 
the Apache logs available for data processing. Due to 
the restrictions imposed by the campus, WCIS has had 
to run off of Apache logs rather than the live network 
and the logs are not always configured to log these fea- 
tures. Additionally, WCIS does not look at the IP address 
of the client or the return code as it is not concerned with 
detecting the activities of unique clients or whether an 
attack failed or succeeded. It is concerned with discov- 
ering patterns that indicate a zero-day attack has been 
attempted. 

The HTTP command is converted into a discrete 
bitmap where each set bit refers to a specific command. 
For example, bit 0 is set for GET, bit 1 is set for POST 
and so on. The HTTP protocol is likewise converted into 
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a discrete bitmap, although it could also be modeled as a 
real value. The length of the URI is converted into a real 
value feature. Likewise, the number of variables in the 
URI is also converted into a real value feature. The URI 
is then parsed to develop a fingerprint of special char- 
acters used in the URI. Table 4 summarizes the special 
characters modeled in the fingerprint. These characters 
were chosen based on the whitepapers published online 
at cgisecurity.com [23, 24] and based on the inspection of 
later web server attacks. Each special character or char- 
acter sequence listed in Table 4 is modeled as a real value 
feature. 

Real value features are all modeled as a pair of values: 
[base, offset]. The sensor will match a URI if the URI 
value is within the range of base to base+offset. When 
mutating a real value feature, a random value may be 
added or subtracted from the base, the offset or both. The 
base can only be altered by a value from -2 to +2. The 
offset can only be altered by a value of -4 to +4. This 
prevents mutation from wildly changing the range that a 
feature detects. 

A sensor is considered to match a web request when all 
of its expressed features matches the features in the web 
request. For binary features, the feature matches when 
the corresponding bit to the feature is set in the sensor. 
For real value features, a feature matches when its value 
falls within the range of values in the sensor. Note that 
the web request may contain additional features that the 
sensor does not check. The matching is driven by the 
feature set that the sensor expresses, not the feature set 
in the request. 


6 Experimental Results 


WCIS was tested using an attack dataset, a normal 
dataset and an unknown dataset, as described in Section 
4. The attack dataset consists of 179 labeled attacks gath- 
ered from Bugtraq, live web server logs and tests run on 
an un-networked machine. The normal dataset consists 
of 52977 regular requests gathered from the Lincoln Lab- 
oratory DARPA dataset [14] and live web server logs. 

Obviously, the preferred method of testing WCIS 
would have been actual live requests to a web server, as 
this would best approximation of the real-world perfor- 
mance of WCIS. Unfortunately, as described in previous 
sections, the regulations at this university have made it 
difficult to do such testing on live web servers due to 
privacy concerns. Instead, the Apache access.log 
repository for the Computer Science department web 
server was used for the unknown dataset. 11659 random 
requests were pulled from the logs and placed into the 
unknown dataset. 

Besides the datasets, WCIS has many parameters that 
tune its performance. These parameters are: 
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pop The population size for each classification cate- 
gory. A larger population size creates a larger pool 
of initial random sensors and thus a greater likeli- 
hood of randomly creating a “good” sensor. 


gen The maximum number of generations for the affin- 
ity maturation process. The higher this value is, the 
more likely it is that affinity maturation can derive 
“good” sensors even if the random initial sensors 
are only mediocre. 


xover The percentage of the next generation that 
comes from breeding. The remaining percentage of 
the next generation will be survivors. 


mut The mutation rate for the next generation. A higher 
value introduces more random change in each gen- 
eration, which can be beneficial, harmful or benign. 


thresh The threshold for affinity when doing negative 
selection. Sensors with affinity above this threshold 
are destroyed. 


agree The number of sensors that must agree a request 
is an attack before it is labeled as an attack. For 
classification, 2 * agree must label an unknown 
data as an attack before it will be classified. 


WCIS was tested with population sizes of 25, 50 and 
75 for each classification category. Each sensor in a pop- 
ulation is analogous to a rule in an IDS in that it looks for 
a specific pattern in the web request. Note that the actual 
total number of sensors tested in each tested run of WCIS 
was pop*xnumber_of_classifications. Refer- 
ences to “population size” in this section refers to 
the number of sensors for each classification cate- 
gory (pop), not the total number of sensors tested 
(popxnumber_of_classifications). 

The maximum number of generations tested were 10, 
20, 30, 40 and 50. The mutation rates tested were 1%, 
2.5%, 5% and 10%. The value for xover was 0.6, 
the value for threshold was 0.0002 and the value for 
agree was 3, as prior testing has shown these values 
yield good results. 


6.1 Runtime 


One of the first concerns with any method that uses evo- 
lutionary computation, such as genetic algorithms, is 
how long it takes the algorithm to complete. This is 
one of the motivations behind separating the operation of 
WCIS into phases: pre-deployment, detection and sensor 
refinement. Only the pre-deployment and sensor refine- 
ment phases will need to run the genetic algorithm. 

To test the runtime for the pre-deployment phase, 
WCIS was tested on a Xeon E5410 2.33GHz machine 
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Figure 1: Average runtime for the pre-deployment phase 
of WCIS for the three tested population sizes for each 
classification label. Note that the actual total number of 
sensors is pop * 6 since there were 6 classification la- 
bels tested. This is purely the pre-deployment phase run- 
time, not the detection or sensor refinement phase run- 
time. The detection phase runtime was 0.23 to 0.61 sec- 
onds. 


with 4GB of RAM. The pre-deployment phase was 
coded as a single-threaded process. The population for 
each classification label was processed in a series, using 
round-robin scheduling (e.g. it processed the first gen- 
eration of the info class, then the first generation of the 
traversal class, and so on). With sufficient memory to 
hold multiple copies of the normal and attack datasets, 
WCIS could easily be changed to a multi-threaded pro- 
gram with a thread for each population, which would 
lead to a substantial decrease in the runtime and increase 
in scalability. The current prototype was also coded in 
C++, which could be changed to a more efficient pro- 
gramming language in future versions to provide addi- 
tional scalability. 


These changes were not made because the point of this 
test was to run the genetic algorithm under less than ideal 
conditions to illuminate choke-points in the underlying 
algorithms. These choke-points might not be apparent if 
the code runs too quickly for any differences between in- 
put parameters to become significant. This might leave 
inefficient areas in the underlying algorithms that could 
affect future scalability. Additionally, if the runtime for 
WCIS is reasonable under these less than ideal, and eas- 
ily remedied, coding conditions, then we can be reason- 
ably assured that there are not choke-points in the under- 
lying algorithms. 

As shown in Figure 1, even with the largest popula- 
tion sizes and number of generations, WCIS trained the 
sensors in the pre-deployment phase in under six min- 
utes. This is very reasonable for an evolutionary algo- 


rithm, so it is unlikely that there are hidden scalability 
issues in the underlying algorithms. Converting WCIS 
to a multi-threaded program in a more efficient program- 
ming language should yield even faster results. The sen- 
sor refinement phase is expected to have a similar run- 
time as it needs to run through a similar lifecycle. These 
results also emphasize why it is important to separate off 
the evolutionary phases as back-end processes on a sep- 
arate system from the deployment system. It would be 
unacceptable to wait 6 minutes for the sensors to refine 
themselves on a live system, but the separation allows 
the deployed sensors to continue monitoring live traffic 
while the back-end system refines the sensors. 

While it was not possible at this time to test the detec- 
tion phase with live data due to the previously described 
issues, presenting WCIS with the 11659 unknown re- 
quests to emulate the detection phase took from 0.23 to 
0.61 additional seconds on average, including the extra 
I/O time to load the unknown dataset from disk, log clas- 
sifications and log the classification statistics that are pre- 
sented in the remaining results. There seemed to be lit- 
tle correlation between population size and the additional 
time required for WCIS to test the unknown requests. For 
example, the population size of 50 had the lowest aver- 
age time, while the population size of 25 had the highest 
average time. This suggests most of variance in the time 
to test the unknown dataset was due to I/O latency, par- 
ticularly since the test system had only a consumer-grade 
SATA drive. 

More testing will need to be done to determine the re- 
alistic traffic rates that WCIS can handle during the de- 
tection phase. These can be conducted once the depart- 
ment’s isolated network is completed. 


6.2 Accuracy at Classification 


Since the primary fitness function was the accuracy at 
classifying the attack dataset, let us look at the best ac- 
curacy for each population in the test runs. Five separate 
populations for each classification label were tested for 
each possible combination of variables. The best per- 
forming population for each classification and combina- 
tion was examined. 

The best performing populations when the population 
size was 25 had a maximum number of generations of 40 
and a mutation rate of 1%, as shown in Figure 2. The 
small population size means that WCIS starts with less 
random diversity. This means the affinity of the initial 
sensors might be quite low for their desired classifica- 
tion, whereas with a larger population there is a higher 
chance of randomly generating an antibody with moder- 
ate to strong affinity for the class. Because of this low 
affinity in early generations, the small population size 
needs more generations for affinity maturation. In par- 
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Figure 2: Detection accuracy for each class when the 
population size for each classification is 25, the maxi- 
mum generations is 40 and the mutation rate is 1%. This 
was the best performing population when the population 
size for each classification was 25. 


ticular, Figure 2 shows that the “script” and “traversal” 
classes took the longest number of generations to plateau 
in accuracy. However, increasing the maximum genera- 
tion to 50 actually led to overfitting, where the fitness 
started to decrease in the final generations. This popu- 
lation size also needed the lowest mutation rate of the 
best tested population sizes. While mutation can help 
increase the likelihood that the appropriate feature(s) for 
that classification are affected in a beneficial way, there is 
also the possibility that mutation might negatively affect 
the accuracy. A small population is less able to recover 
from a negative mutation than a large population. 


The best performing populations when the population 
size was 50 had a maximum number of generations of 
10 and a mutation rate of 2.5%, as shown in Figure 3. 
Since WCIS starts off with a larger random population, 
it is better able to withstand negative mutations and a 
higher mutation rate can also increase the likelihood of 
beneficial mutations. This population size also does not 
need as many generations to achieve good accuracy at 
classification since it starts with a larger random pool of 
sensors and there is a greater likelihood of a good sensor 
being randomly generated in the initial generation. As 
with a population size of 25, too many generations led to 
overfitting and a decrease in accuracy, as shown in Figure 
4 where the maximum number of generations is 30. 

The best performing populations when the population 
size was 75 had a maximum number of generations of 20 
and a mutation rate of 5%, as shown in Figure 5. While 
most of the classification accuracies plateaued in early 
generations, the slightly higher rate of mutation allowed 
for the “info” and “traversal” classifications to randomly 
find the right combination of features to increase accu- 
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Figure 3: Detection accuracy for each class when the 
population size for each classification is 50, the maxi- 
mum generations is 10 and the mutation rate is 2.5%. 
This was the best performing population when the popu- 
lation size for each classification was 50. 


racy in later generations. As noted with the other pop- 
ulation sizes, a higher number of maximum generations 
led to overfitting. 

Several trends were noticed across all combinations 
of variables tested. First, regardless of population size, 
maximum generations and mutation rates, the popula- 
tions had great difficulty correctly identifying the “info” 
class of attacks, as shown in Figures 2 through 4. This is 
not surprising as the “info” class is the hardest to distin- 
guish from normal data. Information gathering attacks 
are also hard to distinguish from innocent mistakes, such 
as a typo in the URI. 

Second, as noted above, overfitting and loss of accu- 
racy is seen in all tested combinations of variables when 
the number of generations is high. This is a general 
problem in single-objective, single-crossover genetic al- 
gorithms. This is caused by a loss of diversity within the 
population. In essence, the sensors become too special- 
ized for specific attack instances and lose the ability to 
detect more generalized attacks or attacks which lay on 
the peripheral of the non-self space. It may be the case 
that another genetic algorithm would be better suited to 
this problem domain. For example, a multi-objective ge- 
netic algorithm, such as NSGA-II [6, 7], is designed to 
maintain diversity by balancing multiple fitness objec- 
tives. 

Overall however, the classification scheme employed 
by WCIS achieves a high rate of accuracy, particularly 
in the classifications with a large set of attack instances 
such as “traversal”, “script” and “xss”. While no popu- 
lation was able to obtain 100% accuracy in those cate- 
gories, this may be due to the diversity issue. Even so, 
the accuracy for “traversal” was 81% in many popula- 
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Figure 4: Detection accuracy for each class when the 
population size for each classification is 50, the maxi- 
mum generations is 30 and the mutation rate is 2.5%. 
Note that the extra generations do not yield better results 
than Figure 3. In fact, overfitting occurs within several 
classification populations. 


tions, the accuracy for “script” was 92% in many popu- 
lations and the accuracy for “xss” was 92 — 97% in many 
populations. 


6.3 Labeling Unknown Data 


After inspecting the accuracy rates, next let us look at 
how well WCIS could label unknown data gleaned from 
Apache access logs. The access logs for the Computer 
Science web server are rotated on a monthly basis, with 
data going back for several years. Random entries were 
selected out of two months of access logs. This created 
an unknown dataset with 11659 entries in it. 

After each population finished affinity maturation, it 
was presented this dataset to label. This emulated a live 
scan of web traffic. While this test was sufficient to eval- 
uate the effectiveness of WCIS at detecting zero-day at- 
tacks, it does not provide metrics for the scalability of 
WCIS. That would require live testing on networks with 
various traffic capacities. Unfortunately, due to the previ- 
ously described challenges with conducting this research 
in Our campus environment, that was not possible at this 
time. So this test purely focuses on gauging WCIS’s abil- 
ity to detect zero-day attacks and attack variants and its 
false positive rate when given a large dataset of unlabeled 
web requests. 

It quickly became apparent when looking at the alerts 
that WCIS raised that someone had tried to attack the 
web server repeatedly during the time frame covered by 
the Apache logs. Table 5 shows a subset of the attacks 
detected by the best population of size 25. Table 6 shows 
a subset of the attacks detected by the best population of 
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Figure 5: Detection accuracy for each class when the 
population size for each classification is 75, the maxi- 
mum generations is 20 and the mutation rate is 5%. This 
combination of variables had the best classification accu- 
racies of all tested parameters. 


size 50. Finally, Table 7 shows a subset of the attacks 
detected by the best performing population of size 75. 

As seen in the sample of detected attacks, someone at- 
tempted to access the /proc/self/environ file, which con- 
tains a list of environmental variables, using various di- 
rectory traversal attempts. This particular attack is actu- 
ally associated with getting a shell on poorly configured 
web servers using a combination of directory traversals 
and shellcode or shell commands injected via the User- 
Agent field. Even though WCIS does not include the 
User-Agent field in its feature set, and thus didn’t see the 
actual shellcode that was attempted, it still detected this 
attack as it appeared in the unknown dataset. 

WCIS did have difficulty deciding whether this attack 
was a “traversal” or a “script” attack since this attack 
uses directory traversals to access /proc/self/environ to 
execute code. The attack dataset does classify such at- 
tacks as “script” even though they contain features of a 
“traversal”. As pointed out in Table 1, only the attacks 
which were read-only (such as retrieving the password 
file) were labeled as “traversal” in the attack dataset. If 
the directory traversal resulted in an attempt to execute 
code, it was labeled as a “script” in the attack dataset. 
Thus, it is not unexpected to see that WCIS has difficulty 
determining if these attacks were a “script” or a “traver- 
sal” since its feature set does not, as of now, include the 
portion of the attack (the User-Agent field) that would 
have made it clear it was a “script” attack. 

Additionally, looking at the voting data, the attacks 
labeled as “traversal” also had votes for “script”, with 
many cases having only a difference of one or two votes 
between the two labels. So the “script” population was 
detecting these attacks, just not quite as vigorously as the 
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Table 5: A sample of unknown requests detected as attacks in the unknown dataset for the population in Figure 2. 


/* php?option=com_dump&controller=..//..//..//..//..//..//..//..///proc/selt/environ%0000 


Class URL 
script GET 
HTTP/1.1 
traversal GET /.php?index=../../../.././0/././././././../../../proc/self/environ%00 HTTP/1.1 
traversal 


GET /courses/Is290//index.php?p=../../../../././0////./././../..//proc/self/environ%0000 HT TP/1.1 


Table 6: A sample of unknown requests detected as attacks in the unknown dataset for the population in Figure 3. 


Class URL 
script GET 

//proc/self/environ%0000 HTTP/1.1 
script GET /cs150/index.php?p=../../ HTTP/1.1 
traversal GET 

HTTP/1.1 


“traversal” population. 

Being able to detect attacks is desirable, but one also 
wants an IDS to have a low rate of false alarms. WCIS 
did not falsely alarm on any of the normal requests in the 
unknown dataset. This may be due to the fact that some 
of the requests in the normal dataset were also gleaned 
from the department Apache access logs. However, this 
is a good result since it shows that WCIS is easily tuned 
to the normal traffic for a specific website by using a sam- 
pling of that normal traffic to generate the normal dataset. 


7 Conclusions 


This paper presented a method of detecting zero-day at- 
tacks on web servers via malicious requests that is based 
on artificial immune systems. This prototype system, 
called Web Classifying Immune System (WCIS), is in- 
tended to augment the capabilities of an existing intru- 
sion detection system (IDS) by detecting attacks that are 
not detectable by the existing IDS. WCIS is a modified 
artificial immune system (AIS) that adds classification. 
WCIS also seeks to improve the efficiency of an AIS by 
separating tasks into the pre-deployment phase, detection 
phase and sensor refinement phase instead of requiring 
all these tasks to take place within a single AIS lifecycle. 
This allows the detection phase to focus on low-resource, 
speedy sensors while the more costly evolutionary com- 
putation associated with the other phases occurs on a sep- 
arate back-end system. 

Notably, WCIS is able to achieve a high rate of ac- 
curacy at detecting most classes of attacks in the at- 
tack dataset, with the exception of the “info” attacks, 
which are difficult to distinguish from normal requests. 
When tested against unlabeled data from Apache access 
logs, WCIS is able to identify attacks within the requests 
without falsely alerting on normal traffic. WCIS does 
have some difficulty choosing between the “traversal” 
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M////2option=com_dump&controller=../../../././ LALLA LLL LLL LLL LL 


/* php?option=com_dump&controller=..//..//..//..//..//..//..//..///proc/selt/environ%0000 


and “‘script” classifications when the “script” attack uses 
some elements of directory traversal in its URI. This is 
likely due to the fact that WCIS only models the HTTP 
method, URI and HTTP protocol. However, even with 
this limitation, WCIS is able to detect that an attack con- 
taining elements of a directory traversal has occurred. 

In summary, WCIS is able to achieve a high rate of 
accuracy at detecting and classifying attacks against web 
servers without falsely alarming on normal traffic when 
properly trained on the normal traffic patterns of the net- 
work. WCIS can be easily trained on the normal traf- 
fic patterns by giving it a sampling of web server logs, 
such as Apache logs. The ability to classify the attacks 
is particularly noteworthy as it allows an administrator to 
rapidly focus on the initial mitigation and response tech- 
niques. It might also lead to integration with an auto- 
mated response engine, although that has not yet been 
explored for WCIS. 


$8 Future Work 


The next phase of development for WCIS will focus on 
creating an appropriate test bed. The department has re- 
cently secured a Department of Education grant that is 
funding the expansion of research laboratory space. A 
portion of this grant is being used to develop an isolated 
network. This can be used to test WCIS (and other se- 
curity tools) without concern about running afoul of the 
campus privacy regulations. This is not a perfect solu- 
tion, as it will still be a simulated environment instead of 
a live environment, but it will permit the full testing of 
the sensor refinement phase, which has been hampered 
by the campus regulations. This will also allow scala- 
bility testing, although the isolated network funding cur- 
rently limits the test bed to Gigabit Ethernet instead of 
10 Gigabit Ethernet, so there will be limitations to test- 
ing the scalability to high capacity networks. 
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Table 7: A sample of unknown requests detected as attacks in the unknown dataset for the population in Figure 5. 


Class URL 
script GET /.../ports_labeled.jpg HTTP/1.1 
traversal GET 
HTTP/1.1 
traversal GET 
./../../../../proc/self/environ%00 HTTP/1.1 
script GET /faculty/interests/..\\index.html HTTP/1.1 


Another area of future development is expanding the 
feature set used by WCIS sensors. Currently, the feature 
set of WCIS only models the request line from the re- 
quest, consisting of the HTTP method, URI and HTTP 
version. It does not model the general headers, request 
headers, entity headers or message body specified by Hy- 
pertext Transfer Protocol version 1.1 for the HTTP re- 
quest. This limitation arose because WCIS had to be run 
on Apache access logs, instead of live data, due to pol- 
icy restrictions at the institution. The available Apache 
logs did not consistently record any header fields. How- 
ever, attackers are using the header fields as a part of their 
attacks so WCIS should expand its feature set to model 
these aspects of malicious web server requests. The iso- 
lated network test bed should enable the incorporation of 
these fields into the sensor feature set, since WCIS will 
no longer be constrained by the formatting of the Apache 
logs. 

Additionally, as noted in Section 6, the genetic algo- 
rithm currently being used by WCIS may not be the best 
algorithm for this problem domain. It suffers from a loss 
of diversity, which leads to overfitting and a decreased 
accuracy at detecting and classifying attacks as the gen- 
erations progress. Another avenue of future research is 
to explore how other genetic algorithms such as multi- 
objective genetic algorithms can improve diversity in the 
sensor population. This diversity will also be useful in 
detecting novel attacks that do not clearly fall under one 
of the existing classification categories. 
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Abstract 


Network providers are challenged by new requirements 
for fast and error-free service turn-up. Existing ap- 
proaches to configuration management such as CLI 
scripting, device-specific adapters, and entrenched com- 
mercial tools are an impediment to meeting these new re- 
quirements. Up until recently, there has been no standard 
way of configuring network devices other then SNMP 
and SNMP is not optimal for configuration management. 
The IETF has released NETCONF and YANG which are 
standards focusing on Configuration management. We 
have validated that NETCONF and YANG greatly sim- 
plify the configuration management of devices and ser- 
vices and still provide good performance. Our perfor- 
mance tests are run in a cloud managing 2000 devices. 

Our work can help existing vendors and service 
providers to validate a standardized way to build con- 
figuration management solutions. 


1 Introduction 


The industry is rapidly moving towards a service- 
oriented approach to network management where com- 
plex services are supported by many different systems. 
Service operators are starting a transition from managing 
pieces of equipment towards a situation where an opera- 
tor is actively managing the various aspects of services. 
Configuration of the services and the affected equip- 
ment is among the largest cost-drivers in provider net- 
works [9]. Delivering valued-added services, like MPLS 
VPNS, Metro Ethernet, and IP TV is critical to the prof- 
itability and growth of service providers. Time-to-market 
requirements are critical for new services; any delay in 
configuring the corresponding tools directly affects de- 
ployment and can have a big impact on revenue. In re- 
cent years, there has been an increasing interest in find- 
ing tools that address the complex problem of deploying 
service configurations. These tools need to replace the 
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current configuration management practices that are de- 
pendent on pervasive manual work or ad hoc scripting. 
Why do we still apply these sorts of blocking techniques 
to the configuration management problem? As Enck [9] 
points out, two of the primary reasons are the variations 
of services and the constant change of devices. These 
underlying characteristics block the introduction of au- 
tomated solutions, since it will take too much time to 
update the solution to cope with daily changes. We will 
illustrate thata NETCONF [10] and YANG [4] based so- 
lution can overcome these underlying challenges. 

Service providers need to be able to dynamically adopt 
the service configuration solutions according to changes 
in their service portfolio without defining low level de- 
vice configuration commands. At the same time, we need 
to find a way to remove the time and cost involved in the 
plumbing of device interfaces and data models by au- 
tomating device integration. We have built and evaluated 
a management solution based on the IETF NETCONF 
and YANG standards to address these configuration man- 
agement challenges. NETCONF is a configuration man- 
agement protocol with support for transactions and dedi- 
cated configuration management operations. YANG is a 
data modeling language used to model configuration and 
state data manipulated by NETCONF. NETCONF was 
pioneered by Juniper which has a good implementation 
in their devices. See the work by Tran [23] et. al for 
interoperability tests of NETCONF. 

Our solution is characterized by the following key 
characteristics: 


1. Unified YANG modeling for both services and de- 
vices. 


2. One database that combines device configuration 
and service configuration. 


3. Rendering of northbound and southbound interfaces 
and database schemas from the service and device 
model. Northbound are the APIs published to users 
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of NCS, be it human or programmatic interfaces. 
Southbound is the integration point of managed de- 
vices, for example NETCONF. 


4. A transaction engine that handles transactions from 
the service order to the actual device configuration 
deployment. 


5. An in-memory high-performance database. 


To keep the service and device model synchronized, 
(item 1 and 2 above), it is crucial to understand how a 
specific service instance is actually configured on each 
network device. A common problem is that when you 
tear down a service you do not know how to clean up the 
configuration data on a device. It is also a well-known 
problem that whenever you introduce a new feature or 
a new network device, a large amount of glue code is 
needed. We have addressed this again with annotated 
YANG models rather then adaptor development. So for 
example, the YANG service model renders a northbound 
CLI to create services. From a device model in YANG 
we are actually able to render the required Cisco CLI 
commands and interpret the response without the need 
for the traditional Perl and Expect scripting. Currently 
our solution can integrate without any plumbing. 

It is important to address the configuration manage- 
ment problem using a transactional approach. The trans- 
action should cover the whole chain including the indi- 
vidual devices. Finally, in order to manipulate config- 
uration data for a large network and many service in- 
stances we need fast response to read and write oper- 
ations. Traditional SQL and file-based database tech- 
nologies fall short in this category. We have used an in- 
memory database journaled to disk in order to address 
performance and persistence at the same time. 

The objectives of this research are to determine 
whether these new standards can help to eliminate the de- 
vice integration problem and provide a service configu- 
ration solution utilizing automatically integrated devices. 
We have studied challenges around data-model discov- 
ery, interface versioning, synchronization of configura- 
tion data, multi-node configuration deployment, trans- 
actional models, and service modeling issues. In order 
to validate the approach we have used simulated scenar- 
ios for configuring load balancers, web servers, and web 
sites services. Throughout the use-cases we also illus- 
trate the possibilities for automated rendering of Com- 
mand Line interfaces as well as User Interfaces from 
YANG models. 

Our studies show that a NETCONF/YANG based con- 
figuration management approach removes unnecessary 
manual device integration steps and provides a platform 
for multi-device service configurations. We see that 
problems around finding correct modules, loading them 
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and creating a management solution can largely be auto- 
mated. In addition to this, the transaction engine in our 
solution combined with inherent NETCONF transaction 
capabilities resolves problems around multi-device con- 
figuration deployment. 

We have run performance tests with 2000 devices in an 
Amazon cloud to validate the performance of NETCONF 
and our solution. Based on these tests we see that the 
solution scales and NETCONF provides a configuration 
management protocol with good performance. 


2 Introduction to NETCONF and YANG 


The work with NETCONE and YANG started as a result 
of an IAB workshop held in 2002. This is documented 
in RFC 3535 [18]. 


“The goal of the workshop was to continue the 
important dialog started between network op- 
erators and protocol developers, and to guide 
the IETFs focus on future work regarding net- 
work management.” 


The workshop concluded that SNMP is not being 
used for configuration management. Operators put 
forth a number of requirements that are important for 
a standards-based configuration management solution. 
Some of the requirements were: 


1. Distinction between configuration data and data that 
describes operational state and statistics. 


2. The capability for operators to configure the net- 
work as a whole rather than individual devices. 


3. It must be easy to do consistency checks of config- 
urations. 


4. The availability of text processing tools such as diff, 
and version management tools such as RCS or CVS. 


5. The ability to distinguish between the distribution 
of configurations and the activation of a certain con- 
figuration. 


NETCONF addresses the requirements above. The de- 
sign of NETCONE has been influenced by proprietary 
protocols such as Juniper Networks JUNOScript appli- 
cation programming interface [14]. 

For a more complete introduction see the Communi- 
cations Magazine article [19] written by Sch6nwéalder et 
al. 
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2.1 NETCONF 


The Network Configuration Protocol, NETCONF, is an 
IETF network management protocol and is published in 
RFC 4741. NETCONF is being adopted by major net- 
work equipment providers and has gained strong industry 
support. Equipment vendors are starting to support NET- 
CONE on their devices, see the NETCONF presentation 
by Moberg [16] for a list of public known implementa- 
tions. 

NETCONF provides mechanisms to install, manipu- 
late, and delete the configuration of network devices. Its 
operations are realized on top of a simple Remote Pro- 
cedure Call (RPC) layer. The NETCONF protocol uses 
XML based data encoding for the configuration data as 
well as the protocol messages. NETCONF is designed 
to be a replacement for CLI-based programmatic inter- 
faces, such as Perl + Expect over Secure Shell (SSH). 
NETCOMF is usually transported over the SSH protocol, 
using the “NETCONF” sub-system and in many ways 
it mimics the native proprietary CLI over SSH inter- 
face available in the device. However, it uses structured 
schema-driven data and provides detailed structured er- 
ror return information, which the CLI cannot provide. 

NETCOMF has the concept of logical data-stores such 
as “writable-running” or “candidate” (Figure 1). Opera- 
tors need a way to distribute changes to the devices and 
validate them locally before activating them. This is in- 
dicated by the two bottom options in Figure | where con- 
figuration data can be sent to candidate databases in the 
devices before they are committed to running in produc- 
tion applications. 

All NETCONF devices must allow the configuration 
data to be locked, edited, saved, and unlocked. In ad- 
dition, all modifications to the configuration data must 
be saved in non-volatile storage. An example from RFC 
4741 that adds an interface named “Ethernet0/0” to the 
running configuration, replacing any previous interface 
with that name is shown in Figure 2. 


2.2 YANG 


YANG is a data modeling language used to model con- 
figuration and state data. The YANG modeling lan- 
guage is a Standard defined by the IETF in the NETMOD 
working group. YANG can be said to be tree-structured 
rather than object-oriented. Configuration data is struc- 
tured into a tree and the data can be of complex types 
such as lists and unions. The definitions are contained 
in modules and one module can augment the tree in an- 
other module. Strong revision rules are defined for mod- 
ules. Figure 3 shows a simple YANG example. YANG 
is mapped to a NETCONF XML representation on the 
wire. 
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WRITABLE-RUNNING 


edit-config 
copy-config 


(running ) automatic-save 


WRITABLE-RUNNING + STARTUP 


edit-config 
copy-config 


~ copy-config 


CANDIDATE 


edit-config 


copy-confi 
PY g automatic-save 


commit 


CANDIDATE + STARTUP 


edit-config 
copy-config 


commit copy-config 
Cami 


Figure 1: NETCONF Datastores 


<rpc message-id="101" 
xmlns="urn:ietf:params:xml:ns:netconf:base:1.0"> 
<edit-config> 
<target> 
<running/> 
</target> 
<config 
xmilns:xc="urn:ietf:params:xml:ns:netconf:base:1.0"> 
<top xmlns="http://example.com/schema/1.2/config"> 
<interface xc:operation="replace"> 
<name>Ethernet0/0</name> 
<mtu>1500</mtu> 
<address> 
<name>192.0.2.4</name> 
<prefix-length>24</prefix-length> 
</address> 
</interface> 
</top> 
</config> 
</edit-config> 
</ipe> 


<rpc-reply message-id="101" 
xmlns="urn:ietf:params:xml:ns:netconf:base:1.0"> 
<ok/> 

</rpc-reply> 


Figure 2: NETCONF edit-config Operation 
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module acme-system { 
namespace 
"http: //acme.example.com/system"; 
prefix "acme"; 
organization "ACME Inc."; 
contact "joe@acme.example.com"; 
description 
"The ACME system."; 
revision 2007-11-05 { 
description "Initial revision."; 
t 
container system { 
leaf host-name { 
type string; 
t 
leaf-list domain-search { 
type string; 
t 
list interface { 
key "name"; 
leaf name { 
type string; 
: 
leaf type { 
type enumeration { 
enum ethernet; 
enum atm; 
, 
t 
leaf mtu { 
type int32; 
- 
must ifType != ’ethernet’ or 
‘““(ifType = ’ethernet’ and ‘‘ + 
‘‘mtu = 1500)‘° { 
t 


¢¢ C0 


Figure 3: YANG Sample 


YANG also differs from previous network manage- 
ment data model languages through its strong support 
of constraints and data validation rules. The suitability 
of YANG for data models can be further studied in the 
work by Xu et. al [24]. 


3 Our Config Manager Solution - NCS 


3.1 Overview 


We have built a layered configuration solution, NCS, 
Network Configuration Server. See Figure 4. The De- 
vice Manager manages the NETCONF devices in the 
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Figure 4: NCS - The Configuration Manager 


network and heavily leverage the features of NETCONF 
and YANG to render a Configuration Manager from the 
YANG models. At this layer, the YANG models repre- 
sent the capabilities of the devices and NCS provides the 
device configuration management capabilities. 


The Service Manager in turn lets developers add 
YANG service models. For example, it is easy to repre- 
sent end-to-end connections over L2/L3 devices or web 
sites utilizing load balancers and web servers. The most 
important feature of the Service Manager is to transform 
a service creation request into the corresponding device 
configurations. This mapping is expressed by defining 
service logic in Java which basically does a model trans- 
formation from the service model to the device models. 


The Configuration Database, (CDB), is an in-memory 
database journaled to disk. CDB is a special-purpose 
database that targets network management and the in- 
memory capability enables fast configuration valida- 
tion and performs diffs between running and candidate 
databases. Furthermore the database schema is directly 
rendered from the YANG models which removes the 
need for mapping between the models and for exam- 
ple a SQL database. A fundamental problem in net- 
work management is dealing with different versions of 
device interfaces. NCS is able to detect the device in- 
terfaces through its NETCONF capabilities and this 1n- 
formation is used by CDB to tag the database with re- 
vision information. Whenever a new model revision is 
detected, NCS can perform a schema upgrade operation. 
CDB stores the configuration of services and devices and 
the relationships between them. NETCONF defines ded- 
icated operations to read the configuration from devices 
and this drastically reduces the synchronization and rec- 
onciliation problem. 


Tightly connected to CDB 1s the transaction manager 
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Figure 5: The Example 


which manages every configuration change as a transac- 
tion. Transactions include all aspects from the service 
model to all related device changes. 

At this point it is important to understand that the 
NETCONF and NCS approach to configuration manage- 
ment does not use a push and pull approach to versioned 
configuration files. Rather, it is a fine-grained transac- 
tional view based on data models. 

The Rendering Engine renders the database schemas, 
a CLI, and a Web UI from the YANG models. In this way 
the Device Manager features will be available without 
any coding. 


3.2 The Example 


Throughout the rest of this paper we will use an exam- 
ple that targets configuration of web-sites across a load 
balancer and web servers. See Figure 5. 

The service model covers the aspects of a web site; IP 
Address, Port, and URL. Whenever you provision a web 
site you refer to a profile which controls the selection of 
load balancers and web servers. A web site allocates a 
listener on the load balancer which in turn creates back- 
ends that refer to physical web servers. So when provi- 
sioning a new web site you do not have to deal with the 
actual load balancer and web server configuration. You 
just refer to the profile and the service logic will config- 
ure the devices. The involved YANG models are : 


e website. yang: the service model for a web site, it 
defines web site attributes like url, IP Address, port, 
and pointer to profile. 


e 1lb.yang: the device model for load balancers, it 
defines listeners and backends where the listeners 


refers to the web site and backends to the corre- 
sponding web servers. 


e webserver. yang : the device model for a physical 
web server, it defines listeners, document roots etc. 


The devices in our example are: 


— Load Balancer : 1b 


— Web Servers: wwwi, www2, www3 


3.3. The Device Manager 


The Device Manager layer is responsible for config- 
uring devices using their specific data-models and in- 
terfaces. The NETCONF standard defines a capabil- 
ity exchange mechanism. This implies that a device 
reports its supported data-models and their revisions 
when a connection is established. The capability ex- 
change mechanism also reports if the device supports a 
<writable-running> or <candidate> database. 

After connection the Device Manager can then use the 
get-schema RPC, as defined in the netconf-monitoring 
RFC [20] to get the actual YANG models from all the 
devices. NCS now renders northbound interfaces such as 
a common CLI and Web UI from the models. The NCS 
database schema is also rendered from the data-models. 

The NCS CLI in Figure 6 shows the discov- 
ered capabilities for device “wwwl”. We see that 
wwwl supports 6 YANG data-models, interfaces, 
webserver, notif, and 3 standard IETF modules. Fur- 
thermore the web-server supports NETCONF features 
like confirmed-commit, rollback-on-error and 
validation of configuration data. 

In Figure 7 we show a sequence of NCS CLI com- 
mands that first uploads the configuration from all de- 
vices and then displays the configuration from the NCS 
configuration database. So with this scenario we show 
that we could render the database schema from the 
YANG models and persist the configuration in the con- 
figuration manager. 

Now, let’s do some transaction-based configuration 
changes. The CLI sequence in Figure 8 starts a trans- 
action that will update the ntp server on www and the 
load-balancer. Note that NCS has the concept of a can- 
didate database and a running. The first represents the 
desired configuration change and the running database 
represents the actual configuration of the devices. At the 
end of the sequence in Figure 8 we use the CLI command 
‘“compare running brief’’ to show the difference 
between the running and the candidate database. This is 
what will be committed to the devices. Note that we do 
a diff and only send the diff. Our in-memory database 
enables good performance even for large configurations 
and large networks. 
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ncs> show ncs managed-device wwwl capability <RET> 
URI 


candidate:1.0 
confirmed-commit:1.0 
confirmed-commit:1.1 
http://acme.com/if 
http://acme.com/ws 
http: //router.com/notif 
rollback-on-error:1.0 


urn:ietf:params:netconf:capability:notification:1.0 
urn: ietf:params:xml:ns:yang:ietf-inet-types 
urn:ietf:params:xml:ns:yang:ietf-yang-types 

urn: ietf:params:xml:ns: yang: ietf-netconf-monitoring 
validate:1.0 

validate:1.1 

writable-running:1.0 

xpath:1.0 


REVISION MODULE 

2009-12-06 interfaces 

2009-12-06 webserver 

> notif 

2010-09-24 ietf-inet-types 
2010-09-24 ietf-yang-types 
2010-06-22 ietf-netconf-monitoring 


Figure 6: NETCONF Capability Discovery 


ncs> request ncs sync direction from-device <RET> 


ncs> show configuration ncs \ 
managed-device wwwi config <RET> 


host-settings { 
syslog { 
server 18.4.5.6 { 
enabled; 
selector 1; 


t 
ncs> show configuration ncs \ 
managed-device lb config <RET> 


1lbConfig { 
system { 
ntp-server 18.4.5.6; 
resolver { 
search acme.com; 


nameserver 18.4.5.6; 


t 


Figure 7: Synchronize Configuration Data from Devices 


In the configuration scenarios shown in Figure 8 we 
used the auto-rendered CLI based on the native YANG 
modules that we discovered from the devices. So it gives 
the administrator one CLI with transactions across the 
devices, but still with different commands for different 
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ncsf set ncs managed-device \ 
wwwi config host-settings ntp server 18.4.5.7 <RET> 


ncsf set ncs managed-device \ 
lb config lbConfig system ntp-server 18.4.5.7 <RET> 


ncsf compare running brief <RET> 


nes { 
managed-device lb { 
config { 
lbConfig { 
system { 
7 ntp-server 18.4.5.6; 
+ ntp-server 18.4.5.7; 
t 


ncs/, commit 


Figure 8: Configuring two Devices in one Transaction 


vendors in case of non-standard modules. NCS allows 
for device abstractions, where you can provide a generic 
YANG module across vendor-specific ones. 

Every commit in the scenarios described above re- 
sulted in a transaction across the involved devices. In this 
case the devices support the confirmed-commit capabil- 
ity. This means that the manager performs a commit to 
the device with a time-out. If the device does not get the 
confirming commit within the time-out period it reverts 
to the previous configuration. This 1s also true for restarts 
or if the SSH connection closes. 
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3.4 The Service Manager 


In our example we have defined a service model cor- 
responding to web-sites and the corresponding service 
logic that maps the service model to load balancers and 
web servers. The auto-rendered Web UI let operators 
create a web site like the one illustrated in Figure 9. 


| i | Incs/sm/service/ty pe/web-site 


Description 
[gq - 


Url * 


(a wWwW.acme.com 


Ip * 
(A 200.10.10.10 


Port * 


[A 80 


Lb-profile * 


sla-gold 






| 


Figure 9: Instantiating a Web-site Service 


A fundamental part of the Service Manager is that we 
use YANG to model services as well as devices. In this 
way we can ensure that the service model is consistent 
with the device model. We do this at compile time by 
checking the YANG service model references to the de- 
vice model elements. At run-time, the service model 
constraints can validate elements in the device-model 1n- 
cluding referential integrity of any references. Let’s il- 
lustrate this with a simple example. Figure 10 shows a 
type-safe reference from the web-site service model to 
the devices. The YANG leafref construct refers to a 
path in the model. The path is verified to be correct ac- 
cording to the model at compile time. At run-time, if 
someone tries to delete a managed device that is referred 
to by aservice this would violate referential integrity and 
NCS would reject the operation. 

This service provisioning request initiates a hierarchi- 
cal transaction where the service instance is a parent 
transaction which fires off child transactions for every 


leaf 1b { 
description "The load balancer to use."; 
mandatory true; 
type leafref { 
path "/ncs:ncs/ncs :managed-device/ncs:name"; 
7 
t 


Figure 10: Service-Model Reference to Device-Model 


device. In this specific case the selected profile uses 
all web servers at the device layer. Either the complete 
transaction succeeds or nothing will happen. As a result 
the transaction manager stores the resulting device con- 
figurations in CDB as shown in Figure 11. 


Device-modifications 


nes { 
managed=-device lb { 
config { 
lbConfig { 
listener 200.10.10.10 680 { 
service acme-inc { 

number 1: 
URL=-pattern www.acme.com} 
backend 192.168.0.9 8008 { 


} 
backend 192.168.0.10 8008 { 
} 
backend 192.168.0.11 8008 { 


session { 
type IP: 
} 


| } 
} 
} 
managed=-device wwwl { 
config { 
interface etho { 
= alias Oo f{ 
ipv4-address 192.168.0.9; 
- } 


} 

weConfig { 
listener 192.168.0.9 8008 { 
} 


Figure 11: The relationship from a Service to the Actual 
Device Configurations. 


You see that the web-site for acme created a listener 
on the load balancer with backends that maps to the ac- 
tual web servers. The service also created a listener on 
the web server. You might wonder why there is a minus- 
sign for the diff. The reason is that we are actually stor- 
ing how to delete the service. This means that there will 
never be any stale configurations in the network. As soon 
as you delete a service, NCS will automatically clean up. 
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4 Evaluation 


4.1 Performance Evaluation 


We have evaluated the performance of the solution us- 
ing 2000 devices in an Amazon Cloud. The Server is a 
4 Core CPU, 4 GB RAM, 1.0 GHz, Ubuntu 10.10 Ma- 
chine. Here we illustrate 4 test-cases. All test-cases are 
performed as one single transaction: 


1. Start the system with an empty database and upload 
the configuration over NETCONF from all devices 
(Figure 12 A). 


2. Check if the configuration database is in sync with 
all the devices (Figure 12 B). 


3. Perform a configuration change on all devices (Fig- 
ure 13 A). 


4. Create 500 services instances that touch 2 devices 
each (Figure 13 B). 


5. In Figure 14 we show the memory and database 
journal disc space for configuring 500 service in- 
stances. 


All of the test-cases involve the complete transaction 
including the NETCONF round-trip to the actual devices 
in the cloud. So, cold-starting NCS and uploading the 
configuration from 500 devices takes about 8 minutes 
(Figure 12) and 2000 devices takes about 25 minutes. 
The configuration synchronization check utilizes a trans- 
action ID to compare the last performed change from 
NCS to any local changes made to the device. This test 
assumes that there is some way to get a transaction ID 
or checksum from the device that corresponds to the last 
change irrespective of which interface is used. If that is 
not available and you had to get the complete configura- 
tion, then the numbers would be higher. 

Updating the config on 500 devices takes roughly one 
minute, (Figure 8). As seen by Figure 14 the in-memory 
database has a small footprint even for large networks. 
In this scenario it is important to note that we always diff 
the configuration change within NCS before sending it 
to the device. This means that we only send the actual 
changes that are needed and this database comparison is 
included in the numbers. This is an area where we have 
seen performance bottlenecks in previous solutions when 
traditional database technologies are used. 

These performance tests cover two aspects: perfor- 
mance of NETCONF, and our actual implementation. 

NETCONF as a protocol ensures that we achieve at 
least equal performance to CLI screen scraped solutions 
and superior performance to SNMP based configuration 
solutions. XML processing is considerably less CPU in- 
tensive than SSH processing. 
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When running a transaction that touches many man- 
aged devices, we use two tricks that affect performance. 
We pipeline NETCONF RPCs, sending several RPCs in 
a row, and collecting all the replies in a row. We can also 
(in parallel) send the requests to all participating man- 
aged devices, and then (in parallel) harvest the pipelined 
replies. 


NCS is implemented in Erlang [3, 11] and OTP (Open 
Telecom Platform) [22] which have excellent support for 
concurrency and multi-core processors. A lot of effort 
has gone into parallelizing the southbound requests. For 
example initial NETCONF SSH connection establish- 
ment is done in parallel, greatly enhancing performance. 


The network configuration data is kept in a RAM 
database together with a disk journaling component. If 
the network is huge, the amount of RAM required can be 
substantial. When the YANG files are compiled we hash 
all the symbols in the data models, thus the database is 
actually a large tree of integers. This increases process- 
ing speed and decreases memory footprint of the configu- 
ration daemon. The RAM database itself is implemented 
as an Erlang driver that uses skip lists [17]. 


Our measurements show that we can handle thousands 
of devices and hundred thousands of services on off-the- 
shelf hardware, (4 Core CPU, 4 GB RAM, 1.0 GHz). 


We have also made some measurements comparing 
SNMP and NETCONF performance. We read the in- 
terface table using SNMP get-bulk and NETCONF get. 
In general NETCONF performed 3 times quicker than 
SNMP. The same kind of performance improvements us- 
ing NETCONF rather than SNMP can be found in the 
work by Yu and Ajarmeh [25]. 


4.22 NETCONF/YANG Evaluation 


Let’s look at the requirements set forth by RFC 3535 and 
validate these based on our implementation. 


4.2.1 Distinction between configuration data, and 
data that describes operational state and 
statistics 


This requirement is fulfilled by YANG and NETCONF in 
that you can explicitly request to get only the configura- 
tion data from the device, and elements in YANG are an- 
notated if they are configuration data or not. This greatly 
simplifies the procedure to read and synchronize config- 
uration data from the devices to a network management 
system. In our case, NCS can easily synchronize its con- 
figuration database with the actual devices. 
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Figure 13: Making Device Configurations and Service Configurations 
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4.2.2 It is necessary to enable operators to concen- 
trate on the configuration of the network as a 
whole rather than individual devices 


We have validated this from two perspectives 


1. Configuring a set of devices as one transaction. 


2. Transforming a service configuration to the corre- 
sponding device configurations. 


Using NCS, we can apply configurations to a group 
of devices and the transactional capabilities of NET- 
CONF will make sure that the whole transaction is ap- 
plied or no changes are made at all. The NETCONF 
confirmed-commit operation has proven to be espe- 
cially useful in order to resolve failure scenarios. A 
problem scenario in network configuration is that de- 
vices may become unreachable after a reconfiguration. 
The confirmed-commit operation requests the device 
to take the new configuration live but if an acknowledge- 
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Figure 14: Memory and Journaling Disc Space 


ment is not received within a time-out the device auto- 
matically rolls-back. This is the way NCS manages to 
roll-back configurations if one or several of the devices 
in a transaction does not accept the configuration. It is 
notable to see the lack of complex state-machines in NCS 
to do roll-backs and avoid multiple failure scenarios. 

In some cases, you would like to apply a global con- 
figuration change to all your devices in the network. In 
the general case the transaction would fail if one of the 
devices was not reachable. There is an option in NCS 
to backlog unresponsive devices. In this case NCS will 
make the transaction succeed and store outstanding re- 
quests for later execution. 


4.2.3 It must be easy to do consistency checks of 
configurations. 


Models in YANG contain ‘ ‘must’ ’ expressions that put 
constraints on the configuration data. See for example 
Figure 3 where the must expression makes sure that the 
MTU is set to correct size. So for example, a NETCONF 
manager can edit the candidate configuration in a device 
and ask the device to validate it. In NCS we also use 
YANG to specify service models. In this way we can 
use must expressions to make sure that a service config- 
uration is consistent including the participating devices. 
Figure 15 shows a service configuration expression that 
verifies that the subnet only exists once in the VPN. 


4.2.4 It is highly desirable that text processing tools 
[...] can be used to process configurations. 


Since NETCONF operations use well-defined XML pay- 
loads, it is easy to process configurations. For example 
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"count ( 
../../mv:access-link[subnet = 
current()/../subnet]) = 1" { 
error-message "Subnet must be unique 
Within the VPN"; 


must 


t 


Figure 15: Service Configuration Consistency 


doing a diff between the configuration in the device ver- 
sus the desired configuration in the management system. 
The CLI output in Figure 16 shows a diff between a de- 
vice configuration and the NCS Configuration Database. 
In this case a system administrator has used local tools 
on web server | and changed the document root, and re- 
moved the listener. 


4.2.5 It is important to distinguish between the dis- 
tribution of configurations and the activation 
of a particular configuration. 


The concept of multiple data-stores in NETCONF lets 
managers push the configuration to a candidate database, 
validate it, and then activate the configuration by com- 
mitting it to the running datastore. Figure 17 shows an 
extract from the NCS trace when activating a new con- 
figuration in web server 2. 
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ncs> request ncs managed-device \ 
wwwl compare-config outformat cli <RET> 


diff 
nes { 
managed-device wwwi { 
config { 
wsConfig { 
global { 
7 ServerRoot /etc/doc; 
+ ServerRoot /etc/docroot; 
} 
- listener 192.168.0.9 8008 { 
= t 
} 
fF 
} 
} 


Figure 16: Comparing Configurations 


ncsf set ncs managed-device \ 


www2 config wsConfig global ServerRoot /etc/doc <RET> 


ncsf commit | details <RET> 
ncs: SSH Connecting to admin@www2 
ncs: Device: www2 Sending edit-config 
ncs: Device: www2 Send commit 

Commit complete. 


Figure 17: Separation of Distribution of Configurations 
and Activation 


5 Related Work 


5.1 Mapping to Taxonomy of Configura- 
tion Management Tools 


We can map our solution to other Configuration Manage- 
ment solutions based on the taxonomy defined by Delaet 
and Joosen [7]. They define a taxonomy based on 4 cri- 
teria: abstraction level, specification language, consis- 
tency, and distributed management. 

The abstraction level ranges from high-level end-to- 
end requirements to low-level bit-requirements. As 
shown in Figure 18 and described below, in our solution 
we work with level 1-5 of the 6 mentioned abstraction 
levels. 


1. End-to-end Requirements - The service models in 
the Service Manager expresses end-to-end require- 
ments including constraints expressed as XPATH 
must expressions. In the case of our web site provi- 
sioning example this corresponds to the model for a 
web site - website. yang. 


. End-to-end requirements 


Service Manager : 

-  website.yang 
Service Logic 
*  web-site.java 


. Instance distribution rules 


| . Instance configurations 
Device Manager 


hostsettings.yang 
loadbalancer.yang 
webserver.yang 


. Implementation dependent 
instances 





NETCONF 4 ____— 5. Configuration files 
XML payload 


— 


Web Server Load Balancer 


Figure 18: NCS in the Configuration Taxonomy Defined 
by Delaet and Joosen 


2. Instance Distribution Rules - How an end-to-end 
service 1s allocated to resources is expressed in the 
Java Service Logic Layer. In this layer we map the 
provisioning of a web site to the corresponding load 
balancer and web-server models. 


3. Instance Configurations - The changed configura- 
tion of devices in the Device Manager. The result 
of the previous point is a diff, configuration change, 
sent to NCS Device Manager. The Device Man- 
ager has two layers. The device independent layer 
that can abstract different data-models for the same 
feature and the concrete device model layer. This 
layer may be vendor-independent. In Figure 18 we 
indicate a vendor-independent hostsetting. yang 
model which contains a unified model for host set- 
tings like DNS and NTP. 


4. Implementation Dependent Instances - The con- 
crete device configuration in the NCS Device Man- 
ager. This is the actual configuration that is sent to 
the devices in order to achieve the service instanti- 
ation. In the specific example of a web site it is the 
configuration change to the load balancers and web 
servers. 


5. Configuration Files - The NETCONF XML, 
editconfig, payload sent to the devices. Note 
however whereas most tools work with configura- 
tion files, NETCONF does fine-grained configura- 
tion changes. 


6. Bit-Configurations - Disk images are not directly 
managed by NETCONF as such. 


When it comes to the specification language we 
have a uniform approach based on YANG at all lev- 
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els. Delaet and Joosen characterize the specification lan- 
guage from four perspectives: language-based or user- 
interface-based, domain coverage, grouping mechanism 
and multi-level specification. We will elaborate on these 
perspectives below. 

We certainly focus on a language-based approach 
which can render various interface representations. 
Users can edit the configuration using the auto-rendered 
CLI and Web UI. You can also feed NCS with the NET- 
CONF XML encoding of the YANG models. NCS is a 
general purpose solution in that the domain is defined 
by the YANG models and not the system itself. YANG 
supports groupings at the modeling level and NCS sup- 
ports groupings of instance configurations as config- 
urable templates. Templates can be applied to groups of 
devices. 

NCS supports multi-level specifications which in De- 
laets and Joosens taxonomy refers to the ability to trans- 
form the configuration specifications to other formats. 
In our case, we are actually able to render Cisco CLI 
commands automatically from the configuration change. 
This is a topic of its own and will not be fully covered 
here. However NCS supports YANG model-driven CLI 
engines that can be fed with a YANG data-model and 
the engine is capable of rendering the corresponding CLI 
commands. 

Consistency has three perspectives in the taxonomy: 
dependency modeling, conflict management, and work- 
flow management. We do not cover workflow manage- 
ment. We consider workflow systems to be a client to 
NCS. NCS manages dependencies and conflicts based 
on constraints in the models and runtime policies. The 
model constraints specifiy dependencies and rules that 
are constrained by the model itself while policies are run- 
time constrained defined by system administrators. We 
use XPATH [6] expressions in both contexts. 

Regarding conflict management NCS will detect con- 
flicts as violations to policies as described above. The 
result is an error message when the user tries to commit 
the conflicting configuration. 

The final component of the taxonomy covers the as- 
pect of distribution. NCS supports a fine-grained AAA 
system that lets different users and client systems per- 
form different tasks. The agent is a NETCONF client on 
the managed devices. The NCS server itself is central- 
ized. The primary reason here is to enable quick valida- 
tion of cross-device policy validation. The performance 
is guaranteed by the in-memory database. 


5.2 Comparison to other major configura- 
tion management tools 


There are many well-designed configuration manage- 
ment tools like: CFengine [5], Puppet [15], LCFG [2] 


LISA °11: 25th Large Installation System Administration Conference 


and Bcfg2 [8]. These tools are more focused on system 
and host configuration whereas we focus mostly on net- 
work devices and network services. This is mostly de- 
termined by the overall approach taken for configuration 
management. In our model the management system has a 
data-model that represents the device and service config- 
uration. Administrators and client programs express an 
imperative desired change based on the data-model. NCS 
manages the overall transaction by the concept of a can- 
didate and running database which is a well-established 
principle for network devices. 

Many host-management uses concepts of centralized 
versioned configuration files rather than a database with 
roll-back files. Also in a host environment you can put 
your specific agents on the hosts which is not the case for 
network devices. Therefore a protocol based approach 
like NETCONF/YANG is needed. 

Another difference is the concept of desired state. For 
host configuration it is important to make sure that the 
hosts follow a centrally defined configuration which is 
fairly long-lived. In our case we are more focused on 
doing fine-grained real-time changes based on require- 
ments for new services. There is room for combination 
of the two approaches where host-based approaches fo- 
cused on configuration files address the more static setup 
of the device and our approach on top if that addresses 
dynamic changes. 

It is also worth-while noting that most of the existing 
tools have made up their own specific languages to de- 
scribe configuration. YANG is a viable options for the 
above mentioned tools to change to a standardized lan- 
guage. 

There is of course a whole range of commercial tools, 
like Telcordia Activator [21], HP Service Activator [12], 
Amdocs [1], that address network and service configu- 
ration. While they are successfully being used for ser- 
vice configuration, the underlying challenges of cost and 
release-cycles for device adapters and flexibility of ser- 
vice models can be a challenge. 


6 Conclusion and Future Work 


6.1 Conclusion 


We have shown that a standards-based approach for net- 
work configuration based on NETCONF and YANG can 
ease the configuration management scenarios for opera- 
tors. Also the richness of YANG as a configuration de- 
scription language lends itself to automating not only the 
device communication but also the rendering of inter- 
faces like Command Line Interfaces and Web User In- 
terfaces. Much of the value in this IETF standard lies in 
the transaction-based approach to configuration manage- 
ment and a rich domain-specific language to describe the 
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configuration and operational data. We used Erlang and 
in-memory database technology for our reference imple- 
mentation. These two choices provide performance for 
parallel configuration requests and fast validation of con- 
figuration constraints. 


6.2 Future Work 


We have started to work on a NETCONF SNMP adap- 
tation solution which is critical to migrate from cur- 
rent implementations. This will allow for two scenar- 
ios: read-only and read-write. The read-only view is a 
direct mapping of SNMP MIBs to corresponding NET- 
CONF/YANG view, this mapping is being standardized 
by IETF [13]. The read-write view is more complex 
and cannot be fully automated. The main reason is that 
the transactional capabilities and dependencies between 
MIB variables are not formally defined in the SNMP 
SMI, for example it is common that you need to set 
one variable before changing others. We are working on 
catching the most common scenarios and define YANG 
extensions for those in order to automatically render as 
much as possible. 

Furthermore we are working on a solution where we 
can have hierarchical NCS systems in order to cover huge 
networks like nation-wide Radio Access Networks. We 
will base this on partitioning of the instantiated model 
into separate CDBs. NCS will then proxy any NET- 
CONF requests to the corresponding NCS system. 

We are also working on two interesting features in or- 
der to understand the service configuration versus the de- 
vice configuration: “dry-run” and “service check-sync”’. 
Committing a service activation request with dry-run cal- 
culates the resulting configuration changes to the devices 
and displays the diff without committing it. This is help- 
ful in a what-if scenario: “If I provision this VPN, what 
happens to my devices?”. The service check-sync fea- 
ture will compare a service instance with the actual con- 
figuration that is on devices and display any conflicting 
configurations. This is useful to detect and analyze if 
and how the device configurations have been changed by 
any local tools in a way that breaks the service configu- 
rations. 
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Abstract 


This paper describes how we deployed IPv6 in our corporate network in a relatively short time with a small core 
team that carried most of the work, the challenges we faced during the different implementation phases, and the net- 


work design used for IPv6 connectivity. 


The scope of this document is the Google enterprise network. That is, the internal corporate network that involves 
desktops, offices and so on. It is not the network or machines used to provide search and other Google public ser- 


Vices. 


Our enterprise network consists of heterogeneous vendors, equipment, devices, and hundreds of in-house developed 
applications and setups; not only different OSes like Linux, Mac OS X, and Microsoft Windows, but also different 
networking vendors and device models including Cisco, Juniper, Aruba, and Silverpeak. These devices are deployed 
globally in hundreds of offices, corporate data centers and other locations around the world. They support tens of 
thousands of employees, using a variety of network topologies and access mechanisms to provide connectivity. 


Tags: IPv6, deployment, enterprise, early adoption, case study. 


1. Introduction 


The need to move to IPv6 is well-documented and well- 
known - the most obvious motivation being IANA IP v4 
exhaustion in Feb 2011. Compared to alternatives like 
Carrier-Grade NAT, IPv6 is the only strategy that 
makes sense for the long term since only IPv6 can as- 
sure the continuous growth of the Internet, improved 
openness, and the simplicity and innovation that comes 
with end-to-end connectivity. 


There were also a number of internal factors that helped 
motivate the design and implementation process. The 
most important was to break the chicken-or-egg prob- 
lem, both internally and as an industry. Historically, 
different sectors of the Internet have pointed the finger 
at other sectors for the lack of IPv6 demand, either for 
not delivering IPv6 access to users to motivate content 
or not delivering IPv6 content to motivate the migration 
of user networks. To help end this public stalemate, we 
knew we had to enable IPv6 access to Google engineers 
to launch [Pv6-ready products and services. 


Google has always had a strong culture of innovation 
and we strongly believed that IPv6 will allow us to 
build for the future. And when it comes to universal 


access to information we want to provide it to all users, 
regardless of whether they connect using IPv4 or IPv6. 


We needed to innovate and act promptly. We knew that 
the sooner we started working with networking equip- 
ment vendors and with our transit service providers to 
improve the new protocol support, the earlier we could 
adopt the new technology and shake the bugs out. 
Another interesting problem we were trying to solve in 
our enterprise organization was the fact that we are run- 
ning tight on private RFC1918 addresses - we wanted to 
evaluate techniques like Dual-Stack Lite, i.e to make 
hosts IPv6-only and run DS-Lite on the hosts to provide 
IPv4 connectivity to the rest of the world if needed. 


2. Methodology 


Our project started as a grass-roots activity undertaken 
by enthusiastic volunteers who followed the Google 
practice of contributing 20% of their time to internal 
projects that fascinate them. The first volunteers had to 
learn about the new protocol from books and then plan 
labs to start building practical experience. Our essential 
first step was to enable IPv6 on our corporate network, 
so that internal services and applications could follow. 
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Our methodology was driven by four principles: 


1. Think globally and try to enable IPv6 every- 
where: in every office, on every host and every 
service and application we run or use inside 
our corporate network. 

2. Work iteratively: plan, implement, and iterate 
launching small pieces rather than try to com- 
plete everything at once. 

3. Implement reliably: Every IPv6 implementa- 
tion had to be as reliable and capable as the 
IPv4 ones, or else no one would use and rely 
on the new protocol connectivity. 

4. Don't add downtime: Fold the IPv6 deploy- 
ments into our normal upgrade cycles, to avoid 
additional network outages. 


3. Planning and early deployment phases 


First, we started creating a comprehensive addressing 
plan for the different sized offices, campus buildings, 
and data centers. Our initial IPv6 addressing scheme 
followed the guidelines specified in RFC5375 (IPv6 
Unicast address assignment): 


A Assign /64 for each VLAN 
A Assign /56 for each building 
A Assign /48 for each campus or office 


We decided to use the Stateless Address Auto- 
Configuration capability (SLAAC) for IPv6 address 
assignments to end hosts. This stateless mechanism al- 
lows a host to generate its own addresses using a com- 
bination of locally available information and infor- 
mation advertised by routers, thus no manual address 
assignment is required. 


As manually configuring IP addresses has never really 
been an option, this approach addressed various operat- 
ing systems DHCP v6 client support limitations and 
therefore sped the rollout of IPv6. It also provides a 
seamless method to re-number and provide address pri- 
vacy via the privacy extension feature (RFC 4941). 
Meanwhile, we also requested various sized IPv6 space 
assignments from the Regional Internet Registries. Hav- 
ing PI (Provider Independent) IPv6 space was required 
to solve any potential multihoming issues with our mul- 
tiple service providers. 


Next, we had to design the IPv6 network connectivity 
itself. We obviously had several choices here; we pre- 
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ferred dual-stack if possible, but if not then we had to 
build different types of tunnels (as a 6-to-4 transitioning 
mechanism) on top of the existing IPv4 infrastructure or 
to create a separate IPv6 infrastructure. The latter was 
not our preferred choice since this would have meant 
the need for additional time and resources to order data 
circuits and to build a separate infrastructure for IPv6 
connectivity. 


We also tried to design a scalable IPv6 backbone to 
accommodate all existing WAN clouds (MPLS, Internet 
Transit and the Google Production network, which we 
use as our Service provider for some of the locations). 
Along with the decision to build the IPv6 network on 
top of the existing physical one we tried to keep the 
IPv6 network design as close to the IPv4 network in 
terms of routing and traffic flows as possible. The prin- 
ciple of changing only the minimum amount necessary 
was applied here. 


By keeping the IPv6 design simple, we wanted to en- 
sure scalability and manageability; also it is much easier 
for the network operations team to support it. In order 
to comply with this policy we decided to use the follow- 
ing routing protocols and policies: 


HSRPv2 - First hop redundancy 

OSPFv3 - Interior gateway protocol 

MP-BGpP - Exterior gateway protocol 
SLAAC - for IP addresses assignments for the 
end hosts. 


>> > 


Our proposed routing policy consist of the following 
rules: we advertise the office aggregate routes to the 
providers, while only accept the default route from the 
transit provider. 


We also aggressively started testing and certifying code 
for the various hardware vendors’ platforms and work- 
ing on building or deploying IPv6 support into our in- 
house built network management tools. 


In 2008 we got our first ARIN-assigned /40 IPv6 space 
for GOOGLE IT and we deployed a single test router 
having a dual-stacked link with our upstream transit 
provider. The reason for having a separate device was 
to be able to experiment with non-standard IOS ver- 
sions and also to avoid the danger of having higher re- 
source usage (like CPU power). 
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Figure 1: phase I - dual-stack separate hosts and labs 


The early enthusiasts and volunteers to test the IPv6 
protocol had one GRE tunnel each running from their 
workstations to this only IPv6 capable router, which 
was sometimes giving around 200ms latency, due to 
reaching relatively closely located IPv6 sites via a bro- 
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Figure 2: phase II - dual-stack offices 


ker device on the other side of the world. 


The next steps during this initial implementation phase 
were to create several fully dual-stacked labs (Figure 1) 
and connect them to the dual-stacked router using the 
same GRE tunnels, but instead of at certain hosts, these 
GRE tunnels were now terminated at the lab routers. 

In the next phase we started dual-stacking entire offices 
and campus buildings (Figure 2) and then building a 
GRE tunnel from the WAN Border router at each loca- 
tion to the egress IPv6 peering router. 


In the third phase we started dual-stacking entire offic- 
es, while trying to prioritize deployment in offices with 
immediate need for IPv6 (Figure 3), e.g. engineers 
working on developing or supporting applications for 
IPv6. 


Using this phased approach allowed us to gradually gain 
skills and confidence and also to confirm that IPv6 is 
stable and manageable enough to be deployed in our 
network globally. 


4. Challenges 


We faced numerous challenges during the planning and 
deployment phases, not only technical, but also admin- 
istrative and organizational such as resource assign- 
ment, project prioritization and the most important - 
education, training and gaining experience. 


4.1 Networking challenges 

The most important technical issue we faced was the 
fact that the major networking vendors lack enterprise 
IPv6 features, especially on some of the mid-range de- 
vices and platforms. Also certain hardware platforms 
support IPv6 in software only, which causes high CPU 
usage when the packets are handled by the software. 
This has a severe performance impact when using ac- 
cess control lists (ACLs). In another example of limita- 
tions with some of our routing platfors vendors, the only 
IPv6 tunneling mechanism available is Generic Routing 
Encapsulation (GRE). The main reason for this partial 
IPv6 implementation in the networking devices is that 
most vendors are not even running IPv6 in their own 
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Figure 3: phase III - dual-stack the upstream WAN connections to the transit and MPLS VPN providers 


corporate networks. Also the TCAM table in one of the 
switch platforms we use is limited when you enable an 
IPv6 SDM routing template. Another example of a net- 
work challenge is the software only routing support of 
IPv6 in the platforms we deploy as wireless core 
switches. 


Our wireless equipment vendor did not have support for 


IPv6 ACLs and currently lacks support for IPv6 routing. 


We also faced the problem with VLAN pooling on the 
wireless controllers - in that mechanism, the wireless 
controller assigns IP addresses from the different 
VLANs (subnets) on a round-robin basis as each wire- 
less client logs in. We wanted to utilize multiple 
VLANs using this technique to provide easy address 
management and scalability. However, the VLANs 
pooling implementation on our specific vendor leaked 
IPv6 neighbor discovery and multicast Router Adver- 
tisements (RAs) between the VLANs. This introduced 
IPv6 connectivity issues as the clients were able to see 
multiple RAs from outside the client VLANs. The solu- 
tion provided by the vendor in a later software release 
was to implement IPv6 firewalling to restrict the neigh- 
bor discovery and Routers Announcement multicast 
traffic leaking across VLANs. 


One more example is the WAN Acceleration devices 
we use in our corporate network - we cannot encrypt or 
accelerate IPv6 traffic using WCCP (Web Cache Con- 
trol Protocol), since the current protocol standard 
(WCCPv2) does not even support IPv6 and thus is not 
implemented on the devices. Currently we are evaluat- 
ing workarounds like PBR (Policy Based Routing) to 
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with the dual-stack infrastructure is getting a feel how 
much traffic on the links is IPv4 and how much IPv6. 
We still needed to work on collecting, parsing, and 
properly displaying Netflow stats for IPv6 traffic. The 
problem that we have here is due to a specific routing 
platform vendor that is no longer developing the OS 
branch for the specific hardware model we use, while 
the current OS versions do not support NetFlow v9. 


We also faced some big challenges when working with 
various service providers. The SLA that they support is 
very different than the SLA for IPv4, and, in our expe- 
rience, the implementation time for turning up IPv6 
peering sessions takes much longer than IPv4 ones. In 
addition, our internal network monitoring tools were 
unable to alert on base monitoring for IPv6 connectivity 
until recently. 


4.2 Application and client software 

The main problem was that the many application white- 
lists we use for multiple internal applications were in1- 
tially not developed to support IPv6, so when we first 
started implementing IPv6 the users on the IPv6 ena- 
bled VLANs and offices were not able to reach lots of 
our internal online tools. We even got some false posi- 
tive security reports saying that some unknown address- 
es were trying to access restricted online applications. 


In order to fight this problem, we aimed at phasing out 
old end-host OSes and applications that do not support 
IPv6 or where IPv6 is disabled by default. Although we 
no longer support obsolete host OSes in our corporate 
network, there are still some IPv6 related issues with 
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connectivity might be broken due to problems with the 
remote ISATAP router and infrastructure. 


We also still have not fully solved the printer problem, 
an most do not support IPv6 at all or just for manage- 
ment. 


Unfortunately large groups of systems and applications 
exist that cannot be easily modified, even to enable 
IPv6 - for example heavy databases and some of the 
billing applications due to the critical service they offer. 
And on top of that, the systems administrators are often 
too busy with other priorities and do not have the cycles 
to work on IPv6 related problems. 


5. Lessons Learned 


We learned a lot of valuable lessons during the deploy- 
ment process. Unfortunately, the majority of the prob- 
lems we’ve faced were unexpected. 


Since lots of providers still do not offer dual-stack sup- 
port to the CPE (customer-premises equipment), we had 
to use manually built GRE over IPSec tunnels to pro- 
vide IPv6 connectivity for our distributed offices and 
locations. 


Creating tunnels causes changes in the maximum trans- 
mission unit (MTU) of the packets. This often causes 
extra load on the router’s CPU and memory, and all 
possible fragmentation and reassembly adds extra laten- 
cy. Since we often do not have full control over the 
network connectivity from end to end (e.g. between the 
different office locations) we had to lower the IPv6 path 
MTU to 1416 to avoid possible packets being lost due 
to lost ICMPv6 messages on the way to the destination. 


Another big problem we had to deal with was the end 
host OSes immature IPv6 support. For example, some 
of them still prefer IPv4 over IPv6 connectivity by de- 
fault. Some others do not even have IPv6 connectivity 
turned on by default, which makes the users of this OS 
incapable of testing and providing feedback for the IPv6 
deployment. It also turned out that another popular host 
OS does not have client support for DHCPv6 and thus 
we were forced to go with SLAAC for assigning IPv6 
addresses to the end hosts. 


We ran into countless applications problems too: No 
WCCP support for IPv6, no proxy, no VoIP call man- 
agers, and many more. When trying to talk to the ven- 
dors they were always saying - if there is a demand for 
IPV6 support at all, we’ve never heard it before. 


In summary, when it comes to technical problems we 
can confirm that there is a lot of new, unproven and 
therefore buggy code, and getting our vendors aligned 
so that everything supports IPv6 has been a challenge. 


Regarding the organizational lessons we learned - the 
most important one is that IPv6 migration potentially 
touches everything, and so migrating just the network or 
just a single service or application or platform does not 
make sense by itself. This project also turned out to be a 
much longer term project than originally intended. 
We've been working on this project for 4 years already 
and we are still probably only half way to completion. 
Still, the biggest challenge is not deploying IPv6 itself, 
but integrating the new protocol in all management pro- 
cedures and applying all IPv4 current practice concepts 
for it too - for example the demand for redundancy, 
reliability and security. 


6. Summary 


The migration to IPv6 is not an L3 problem. It is more 
of an L7-9 problem: resources, vendor relation- 
ship/management, and organizational buy-in. The net- 
working vendors’ implementations mostly work, but 
they do have bugs: we should not expect something to 
work just because it is declared supported. 


Because of that we had to test every single IPv6 related 
feature, then if a bug was found in the lab we reported it 
and kept on testing! 


7. Current status and future work 


Around 95% of the engineers accessing our corporate 
network have IPv6 access on their desks and are 
whitelisted for accessing Google public services 
(search, Gmail, Youtube etc.) over IPv6. This way they 
can work on creating, testing and improving IPv6 aware 
applications and Google products. At the same time 
internally we keep on working on enabling IPv6 support 
on all our internal tools and applications used in the 
corporate network. 
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Figure 4: Timeline for dual-stacking Google corporate 
locations 


In the long run, the potential of introducing DHCP v6 
(state-full auto-configuration) can be investigated given 
the advantages of DHCP flexibility and better manage- 
ment. However enabling this functionality still depends 
on the support of the end hosts DHCP v6 client on the 
desktop platforms. 


We also want to revisit the IP addressing allocation of 
/64 to every subnet on the corporate network, since a 
new RFC 6164 has been published that recommends 
assigning /127 addresses on P2P links. 


Since the highest priority for all organizations is to 
IPv6-enable their public-facing services, following our 
experience we can confirm - dual-stack works well to- 
day as a transition mechanism! 


There is still quite a lot of work before IPv4 can be 
turned off anywhere, but we are working hard towards 
it. The ultimate goal is to successfully support employ- 
ees working on an IPv6-only network. 
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Abstract 


The Berlin Open Wireless Lab (BOWL) project at 
Technische Universitat Berlin (TUB) maintains an out- 
door WiFi network which is used both for Internet ac- 
cess and as a testbed for wireless research. From the 
very beginning of the BOWL project, we experienced 
several development and operations challenges to keep 
Internet users and researchers happy. Development chal- 
lenges included allowing multiple researchers with very 
different requirements to run experiments in the network 
while maintaining reliable Internet access. On the oper- 
ations side, one of the recent issues we faced was au- 
thentication of users from different domains, which re- 
quired us to integrate with various external authentica- 
tion services. In this paper, we present our experience 
in handling these challenges on both development and 
operations sides and the lessons we learned. 

Keywords: WiFi, configuration management, authenti- 
cation, research, DevOps, infrastructure, testbed 


1 Introduction 


Wireless testbeds are invaluable for researchers to test 
their solutions under real system and network condi- 
tions. However, these testbeds typically remain ex- 
perimental and are not designed for providing Internet 
access to users. In the BOWL project [2, 7, 9], we 
stepped away from the typical and designed, deployed 
and currently maintain a live outdoor wireless network 
that serves both purposes. The benefits are twofold [9]: 


e University staff and students have outdoor wireless 
network access. Our network covers almost the en- 
tire Technische Universitat Berlin (TUB) campus in 
central Berlin (see Fig. 1). 

e Researchers have a fully reconfigurable research 
platform for wireless networking experimentation 
that includes real network traffic (compared to syn- 
thetic traffic). 


During its lifetime, the BOWL network has signif- 
icantly evolved from a prototype architecture and de- 
sign in 2009 [7, 9] towards a production network, which 
brings out several administrational and development 
challenges. The network and its components, including 
traffic generators, routers and switches interconnect with 
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Figure 1: Coverage of the BOWL network on the TU- 
Berlin campus. 


a variety of other networks and infrastructures which are 
not controlled by the BOWL project, adding to the in- 
herent complexity of running a production network. In 
this paper, we focus on two of our many challenges that 
we have experienced in the last year while moving from 
a prototype to a more stable infrastructure. We present 
our challenges from the perspective of development and 
network operations and its reliance on external services, 
respectively. 

Development challenges were - and still are - numer- 
ous [9]. The most prominent is the variety of people 
that work on different subsets of network components, 
and change network configuration and operating system 
images. The requirements for associated services and in- 
frastructure, as well as the research goals, continuously 
change as we and other users change the way the BOWL 
network is used on a daily basis. In fact, our experience 
showed that it was necessary to rewrite the BOWL soft- 
ware significantly during the development as well as the 
operational lifetime of the BOWL project. Many of the 
changes were also triggered with the feedback received 
from external users. 

From a purely operational point of view, authentica- 
tion of users to the BOWL network has proven surpris- 
ingly complex. A project-specific remote authentication 
dial-in user service (RADIUS) installation is used as the 
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pivot point to integrate a number of other distributed 
and disparate authentication solutions. Users include (1) 
centrally managed university IT accounts, (2) users from 
our own department, (3) users of the affiliated external 
institution Deutsche Telekom Laboratories (hereafter T- 
Labs), (4) project-only user accounts, and (5) eduroam 
users. TUB user authentication is a critical part of the 
contractual relationship with the university central IT de- 
partment. The major challenge we faced and still face 
is the recovering from errors that might lie in external 
authentication services that we rely on to support these 
accounts. In this paper, we present a major outage we 
went through due to such problems and the lessons we 
learned. 


2 BOWL (Berlin Open Wireless Lab) and 
DevOps Challenges 


The main task of the BOWL project is to satisfy two 
somehow conflicting requirements from two user groups 
— Internet users and researchers (which are often devel- 
opers). We see the following requirements as DevOps 
challenges: 


e Researchers demand a configurable network 
(development): The testbed is intended for a wide 
selection of research topics ranging from enhancing 
measurement-based physical layer models for wire- 
less simulation [8] to routing protocols [11, 12]. 
Hence, one of the goals of the BOWL project is to 
allow multiple researchers to access the network, 
deploy experimental services, change configura- 
tions and run new experiments or repeat old exper- 
iments while still ensuring Internet access. There- 
fore, the BOWL project required the development 
of several tools to automate software and configu- 
ration deployment in the testbed. We discuss our 
experience with these tools, and how they evolved 
in Section 4. 

e Internet users demand a reliable network (op- 
eration): Changing the network configuration, de- 
ploying and running experiments should not affect 
the availability of Internet access. This implies that 
basic connectivity should not be affected, or only 
for a negligible time duration. It also means that 
services such as authentication, DHCP and DNS 
need to remain available in any experiment setup. 
How BOWL network architecture addressed this 
problem is summarized in Section 3. A major oper- 
ational challenge is the authentication of different 
type of users (e.g., Internet users from TUB and T- 
Labs, and researchers) to the BOWL network, and 
we discuss this in detail in Section 5. 
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3 BOWL Network Architecture 


In addition to its outdoor network, the BOWL project is 
in charge of two additional networks: (1) a smoketest 
network, for early development and testing and (2) an 
indoor network, for small-scale deployment and testing. 
These networks are used for development and staging 
before a full-scale deployment and measurements in the 
outdoor network. Therefore, the research usage pattern 
of the outdoor network is more bursty, with periods of 
heavy activity followed by lighter usage, whereas the 
smoketest and the indoor networks have been in heavy 
use since their deployment in early 2008. In this paper, 
we mainly focus on our experience with the outdoor net- 
work. 

The BOWL network architecture was first presented 
in [9]. In this section, we summarize this architecture to 
give the necessary information to understand the BOWL 
environment and its challenges. The outdoor network 
comprises more than 60 nodes deployed on the rooftops 
of TUB buildings. It spans three different hardware ar- 
chitectures (ARM, MIPS and x86). Each node is pow- 
ered by Power over Ethernet (PoE), which simplifies ca- 
bling requirements. All nodes are equipped with a hard- 
ware watchdog, multiple IEEE 802.1 1a/b/g/n radio in- 
terfaces and a wired Ethernet interface. One radio in- 
terface is always dedicated to Internet access, the addi- 
tional radio interfaces are free to be used in research ex- 
periments, and the wired interface is used for network 
management and Internet connectivity. All nodes are 
connected via at least 100 Mbit/s Ethernet to a router 
that is managed by the project. A VLAN network en- 
sures a flat layer 2 connectivity from our router to each 
node. Our router ensures connectivity to the BOWL in- 
ternal network, the TUB network and the Internet. In 
its default configuration (which is called the rescue con- 
figuration), the network is set up as a bridged layer 2 
infrastructure network. Association to the access inter- 
face and encryption of the traffic is protected by WPA2 
(from the standard IEEE 802.111 [1]). Authentication is 
performed with IEEE 802.1x and RADIUS. 

Each node runs OpenWrt [5] as the operating system. 
The OpenWrt build system typically produces a mini- 
mally configured image. To tailor this image to each 
node, the image is configured at boot time by an auto- 
configuration system that applies a so-called configura- 
tion to the image. A configuration includes all the con- 
figuration files that go under the /etc/config direc- 
tory (the layout is specific to OpenWrt), and additional 
files, scripts and packages that may be needed by the ex- 
perimenter. The details of the auto-configuration system 
are explained in Section 4. 

By default, every node runs a default rescue image 
and uses the aforementioned rescue configuration. Re- 
searchers install guest images in extra partitions and 
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use guest configurations. Because of the unique needs 
of experiment monitoring and reconfiguration at run- 
time, a network management and experiment monitor- 
ing system was developed, which also went through sig- 
nificant changes from its version presented in [9]. In 
essence, it comprises two main components: a node- 
controller, which runs on each node and a central node- 
manager. Each node-controller connects to one node- 
manager. However, with the recent changes, several 
node-managers can be now run in parallel i.e. one 
for each experiment if several parallel experiments are 
needed to be run or for development. Our typical op- 
eration requires one node manager per network (e.g., 
smoketest, indoor and outdoor). Thanks to the under- 
lying VLAN infrastructure and virtualization of the cen- 
tral router, the traffic generated by each experiment can 
be isolated, if multiple experiments are running in the 
network. More details on this topic can be found in [9]. 


Unwanted side effects due to using experiment soft- 
ware (e.g., crashes, slowing down of network services) 
are expected to occur in practice but their effect needs 
to be minimized as much as possible. This is achieved 
thanks to the locally installed images. Indeed, a node 
that is experiencing problems can be rescued by an im- 
mediate reboot into the rescue image. This mode of 
Operation is implemented making use of hardware and 
software watchdogs that periodically check that certain 
services are operational. One example is that, node- 
controllers at each node periodically check connectivity 
to the central node manager and when a disconnection is 
detected, the node is rebooted to the rescue image within 
60s. Note that since each node independently triggers a 
switch to the rescue mode based on its own hardware 
and software watchdogs, nodes do not go down all at the 
same time limiting network disruptions. More details 
on how experiment problems are detected can be found 
in [9]. 

In the remainder of the paper, we focus on how we 
addressed two main challenges: the development chal- 
lenge of supporting multiple network configurations for 
different researchers and the operational challenge of au- 
thentication in the BOWL network. 


4 A Development Challenge: Support- 
ing Multiple Network Configurations for 
Wireless Experimentation 


One of the main goals of the BOWL project is to al- 
low multiple researchers to create experiments, and be 
able to run and repeat their experiments in a consis- 
tent fashion. In the remainder of this section, we first 
summarize the system that we started off with around 
mid 2008, and describe how it evolved during the life- 
time of the project. Essentially, the reliability of the 
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Figure 2: An example of how three researchers maintain 
their own configuration in the BOWL network. 


BOWL network was jeopardized due to several configu- 
ration glitches and therefore, our complete software re- 
write decisions were significantly affected by the need 
to maintain network reliability at all times. 


The node configuration of a given experiment con- 
sists of two parts: (1) an operating system image and 
(2) an experiment configuration. OpenWrt manages 
the whole configuration of the operating system using 
the universal configuration interface (UCI])[6]. We also 
take advantage of the UCI. As the network is used for 
very different purposes, it becomes necessary to main- 
tain consistent network configurations across the users. 
Therefore, initially, we had a configuration database and 
stand-alone scripts to apply these configurations from 
a central server manually. As more nodes were de- 
ployed in the BOWL network, it became a necessity 
to have a more scalable and manageable solution. To 
this end, the existing node-manager and node-controller 
framework was extended to support node configurations. 
The important components to a BOWL user are: (i) 
the web-based front-end to a configuration database, 
and (11) a client-server auto-configuration process that 
runs in node-controllers and the node-manager, respec- 
tively. The auto-configuration scheme was added af- 
ter mid 2010 due to the several failures that occurred 
with the earlier version. Figure 2 illustrates how, for 1n- 
stance, three researchers maintain their configurations in 
the BOWL system. 


Using the web-based front-end, a researcher can pick 
a configuration, image and the node partition to deploy 
its experiment. From this step on, the user flashes his 
own image to this partition and nodes are configured by 
the auto-configuration process at boot time (or before the 
image is booted). However, currently, a researcher still 
needs to record the information about which image was 
used with which experiment configuration. In the fu- 
ture, we are planning to automate this lab bookkeeping 
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process. Finally, a reservation system prevents node and 
image usage conflicts. Currently, the reservation system 
used in BOWL is primitive, in the sense that the entire 
network is reserved to a single researcher for a given pe- 
riod of time. Each researcher is responsible of his image 
and configuration and deploys this image to a given node 
partition. Hence, merging of multiple images from dif- 
ferent experimenters is not expected. 

This framework, complete with a new  auto- 
configuration scheme, is in use since mid 2010 by the 
BOWL group and visiting researchers, that also re- 
motely access our network. We learned several lessons 
since then, which resulted in the current state of the 
framework as we use today. For instance, one issue 
resulted from the inheritance of configurations in the 
database. It was not obvious to us at the beginning that 
researchers would have difficulties discovering the in- 
heritance hierarchy. But some of our early users ap- 
plied changes to the base configuration expecting them 
to take effect in the descendant configuration. To avoid 
such problems, we now expose the inheritance hierarchy 
to the users of our system and visualize it in the web- 
based front-end. Finding a right way to do this also was 
a challenging task. Furthermore, being too accommo- 
dating was not a good idea and we ended up limiting 
the functionality of the web-based front-end. Earlier, 
researchers could push a configuration to a given node 
by just pressing a button. However, since installing im- 
ages and configurations were separated from each other, 
it sometimes resulted in applying a wrong configuration 
to the wrong image. Therefore, we removed this func- 
tionality from the front-end. Actually, this was the main 
reason why an auto-configuration scheme was added to 
the system. A final lesson learned was not to assume any 
network stability during configurations. With our first 
auto-configuration implementation, the nodes fetched 
their configurations from the node-manager right after 
booting. However, if there were any network instabil- 
ities during this time, the watchdog would trigger and 
interfere with the auto-configuration. We now avoid this 
problem by having nodes first fetch their configurations 
before booting the image, configure the image, and boot 
only if all checks pass. While our development activ- 
ities have slowed down as users become more used to 
working with our framework, we are still looking into 
simplifying things even further to lower the entry barrier 
of using the BOWL network. 


5 An Operational Challenge: Authentica- 
tion in the BOWL network 


In exchange of the rooftop usage and installation sup- 
port, the BOWL project has contractual obligations with 
TUB to provide wireless Internet access to staff and stu- 
dents. Hence, we need to provide the usual authenti- 
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Figure 3: Logical diagram of the BOWL authentication 
infrastructure. 


cation and accounting services that would be expected 
from any WiFi access network. To this end, we use the 
widely deployed FreeRADIUS software [4], which is a 
server implementation of RADIUS [10]. When a user 
tries to authenticate to our network, the authenticator 
(hostapd) at the WiFi access point communicates with 
the RADIUS server. Using challenge-based protocols, 
the RADIUS server determines whether credentials pro- 
vided by a user are valid. Using the results from this 
decision process, the access point either allows the user 
to join the network or rejects him. 

One of the main reasons that makes authentication in 
the BOWL network a challenging task is the intercon- 
nections with other networks and the need to provide ac- 
cess to different type of accounts. FreeRADIUS does 
support this by allowing access decisions based on local 
account databases or using the results of requests prox- 
ied to further upstream services, which may in turn again 
be other RADIUS implementations or entirely different 
services. Currently, the BOWL network needs to pro- 
vide access for the following types of accounts (see Fig- 
ure 3): 


e TUB accounts as held by students and members 
of staff in another RADIUS server, administered 
by TUB. Access is provided using PEAP with 
MSCHAPv2. The BOWL network does not hold 
(or ever sees in any other way) passwords associ- 
ated with these accounts, because it just proxies the 
encrypted challenge and response messages. 

e eduroam [3] access is provided by TUB using the 
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same scheme as described above. Accounting data 
for this and the previous scheme are forwarded to 
TUB. 

e Accounts for the local department FG INET, ad- 
ministered by the department of which BOWL is a 
part. The upstream authentication service is a Ker- 
beros installation. Access is provided using TTLS 
with PAP, because this kind of upstream service 
requires that the FreeRADIUS server handles the 
passwords of the users. 

e Local accounts for demonstration and guest access 
purposes, administered by the BOWL network. Im- 
plemented using PEAP with MSCHAPv2. Con- 
trary to the previous schemes, all schemes available 
as default settings in FreeRADIUS provide work- 
ing options here. The credentials are held in a local 
database. 

e Experiment-specific accounts for researchers, ad- 
ministered by the BOWL network. Implemented in 
a vein similar to the local accounts. These special 
accounts are available for us to be able to filter out 
data about traffic generated for the purpose of ex- 
perimentation from the accounting database. 


From this list it follows immediately that support re- 
quirements towards users tend to vary with upstream au- 
thentication source. Administration and support com- 
plexity inevitably increases rapidly with additional sup- 
ported schemes. This complexity which results from the 
highly interconnected nature of BOWL is only bound 
to increase. For example, there are discussions whether 
some parts of Deutsche Telekom Laboratories are to be 
provided access to BOWL using a limited subset of the 
accounts held in an Active Directory service. Also, there 
are plans to move local accounts into a LDAP installa- 
tion for centralized administration. 

In the process of creating all these authentication in- 
terconnections, we have learned that unlike some other 
pieces of server software, FreeRADIUS makes it some- 
what difficult to set up a fresh installation with self- 
written configuration files, because of the inherent com- 
plexity of the flow of authentication requests within the 
server. The developers make a point of telling their users 
to proceed only from the default settings, making small 
incremental changes. Therefore, keeping the configu- 
ration files in a version control system has proven to 
be even more invaluable than with any other service. 
In summary, FreeRADIUS setup and handling can be 
daunting and time-consuming for the administrator who 
works with it extensively for the first time. However, we 
still feel that we have made the right choice. The soft- 
ware 1s freely available under the terms of the GPL, it 
works without any need for modification on the BOWL 
network and it provides an extremely rich feature set. 

Now, monitoring of availability of external authen- 


tication services has become one of our major chal- 
lenges, which requires working test accounts for those 
services. Monitoring software like Nagios provides sup- 
port for self-written plug-ins, but not all upstream ser- 
vice providers are prepared to provide such accounts. 
Testing installations are needed, but they are hard to re- 
alize as they require testing configurations on live nodes. 
Furthermore, the upstream providers may be required 
to accept and serve requests from these testing instal- 
lations. Also, obviously, it must be avoided that the ac- 
counting database is not polluted by bogus/testing data. 
All of this must be done carefully, as FreeRADIUS has 
proven to be a piece of software to which configuration 
changes need to be made with special care because of 
unintentional interactions with other configuration sec- 
tions. 


One important consequence from not being able to 
fully test and monitor external authentication services is 
the loss of usage of the network. This is quite annoy- 
ing when it is due to problems in external services that 
we do not fully control. And loss of control is not just 
a hypothetical scenario. During the spring of 2011, no 
TUB users were able to authenticate to the BOWL net- 
work. Local testing revealed that the reason did not lie 
in the BOWL network installation; requests were passed 
on to the upstream server correctly. The fact that all 
authentication protocols in use are encrypted and state- 
less made further debugging difficult. The hospitaliza- 
tion of our main technical contact person at TUB, who 
was also the only person knowledgeable about the RA- 
DIUS configurations, at exactly this point in time put 
another obstacle in our way to successfully resolve this 
issue. Eventually, it was found that a server certificate 
of one of the upstream servers had expired, leading to 
rejection of user authentication attempts. Luckily, the 
BOWL network bounced back from this incident, and 
we observed a speedy uptake by users again shortly af- 
terwards. The first power users returned the morning 
after the upstream servers were fixed; the number of dis- 
tinct users increased continuously and two weeks later, 
the number of distinct users per day peaked. 


The most that an operational team can do in these 
cases is to rely on its own monitoring tools in order to be 
able to find the source of problems as quickly as possi- 
ble; and to build open and positive relationships with up- 
stream operations teams that make communication and 
collaboration as smooth as possible. We also noticed 
that solving the problem was delayed due to the unavail- 
ability of the only person with the know-how. Based on 
this experience, on our side, we try to make sure that 
the BOWL system knowledge is shared among multiple 
people, who can handle issues independently. 
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6 Current State and Lessons Learned 


To manage a live and experimental testbed is a signifi- 
cant challenge, as one needs to keep both Internet users 
and researchers happy. In this paper, we described the 
auto-configuration and authentication solutions that we 
run to be able to serve both communities. 


We learned several lessons during this phase, which 
we summarize as follows: 


1. Itis important to have complete and thorough docu- 
mentation that details the know-how of the BOWL 
project group. Using our system for the first time is 
currently not trivial. Therefore, more time needs to 
be invested in educating future users and simplify- 
ing operation. 

2. Early adopters of the BOWL framework proved 
that people always find a way to use an interface 
differently than you expect them to. Well-defined 
user interfaces with less functionality turned out to 
be much more useful than providing more function- 
ality with specifications unclear to the user. There- 
fore, it is better to design simple first, and add extra 
functionality when only it is absolutely required by 
the users. 

3. While building the BOWL framework, we once 
more realized how important user-friendly inter- 
faces are. People should be exposed all the neces- 
sary information to run the system correctly easily. 

4. In a live network, network disruptions will hap- 
pen. Therefore, all functionality should be de- 
signed around issues that can rise from network in- 
stability. 

5. Our authentication problems showed that the most 
important thing is to maintain a good contact with 
all the parties that can affect operation. More than 
expected, the problem lies outside our own net- 
work, and we need to rely on problem solving skills 
of the upstream service providers. 

6. FreeRADIUS configuration changes should be 
maintained in a version control system. This makes 
it a lot easier to revert to a previously working ver- 
sion. 

7. The complexity of any important component of the 
network, such as authentication services, is only 
going to increase as the number of interconnections 
increases. Being aware of this fact aids in the plan- 
ning of upcoming changes and aids with the inte- 
gration into previously existing configuration op- 
tions. 

8. Finally, we learned that it is essential not to create 
information bottlenecks in a project team, and there 
should always be multiple people who know how to 
handle problems independently of others. 
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Abstract 


This paper investigates the main causes that make the application migration to Cloud complicated and error-prone through 
two case Studies. We first discuss the typical configuration errors in each migration case study based on our error catego- 
rization model, which classifies the configuration errors into seven categories. Then we describe the common installation 
errors across both case studies. By analyzing operator errors in our case studies for migrating applications to cloud, we 
present the design of CloudMig, a semi-automated migration validation system with two unique characteristics. First, we 
develop a continual query (CQ) based configuration policy checking system, which facilitate operators to weave important 
configuration constraints into CQ-based policies and periodically run these policies to monitor the configuration changes 
and detect and alert the possible configuration constraints violations. Second, CloudMig combines the CQ based policy 
checking with the template based installation automation to help operators reduce the installation errors and increase the 
correctness assurance of application migration. Our experiments show that CloudMig can effectively detect a majority of the 
configuration errors in the migration process. 

Keywords: System management, Cloud Computing, Application Migration 
Technical area: Cloud Computing 


1 Introduction 


Cloud computing infrastructures, such as Amazon EC2 [3], provide elastic, economical and scalable solutions and out- 
sourcing opportunities for different types of consumers and end-users. Its pay-as-you-go utility-based computing model 
attracts many enterprises to build their information technology services and applications on the EC2-like cloud platform(s) 
and many successfully achieve their business objectives, such as SmugMug, Twistage and so forth. An increasing number of 
enterprises embrace Cloud computing by making their deployment plans or engaging in the process to migrate their services 
or applications from a local data center to the Cloud computing platform like EC2, because this will greatly reduce their 
infrastructure investments, simplify operations, and obtain better quality of information service. 

However, the application migration process from the local data center to the Cloud environment turns out to be quite com- 
plicated: error-prone, time-consuming and costly. Even worse, the application may not work correctly after the sophisticated 
migration process. Existing approaches mainly complete this process in an ad-hoc manual manner and thus the chances of 
error are very high. Thus how to migrate the applications to the Cloud platform correctly and effectively poses a critical 
challenge for both the research community and the computing service industry. 

In this paper, we investigate the factors and the causes that make the application migration process complicated and error- 
prone through two case studies, which migrate Hadoop distributed system and RUBiUS multi-tier Internet service from a 
local data center to Amazon EC2. We first discuss the typical configuration errors in each migration case study based on our 
error categorization model, which classifies the configuration errors into seven categories. Then we describe the common 
installation errors across both case studies. We illustrate each category of errors by examples through selecting a subset of 
the typical errors observed in our experiments. We also present the statistical results on the error distributions in each case 
study and across case studies. By analyzing operator errors in our case studies for migrating applications to cloud, we present 
the design of CloudMig, a semi-automated migration validation system that offers effective configuration management to 
simplify and facilitate the migration configuration process. The CloudMig system makes two unique contributions. First, 
we develop a continual query based configuration policy checking system, which facilitate operators to weave important 
configuration constraints into continual query policies and periodically run these policies to monitor the configuration changes 
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and detect and alert the possible configuration constraints violations. Second, CloudMig combines the continual query based 
policy checking system with the template based installation automation system, offering effective ways to help operators 
reduce the installation errors and increase the correctness assurance of application migration. Our experiments show that 
CloudMig can effectively detect a majority of the configuration errors in the migration process. 

In the following sections, we discuss the potential causes that lead to the complicated and error-prone nature of the 
migration process in Section 2. We review the existing approaches and their limitations in Section 3. We report our operator- 
based case studies in Section 4 through a series of experiments conducted on migrating distributed system applications and 
multi-tier Internet services from local data center to Amazon EC2-like cloud, including the common migration problems 
observed and the key insights for solving the problems. In Section 5 we present the design of the CloudMig system, which 
provides both the configuration validation and installation automation to simplify the migration process. 


2 Why Migration to Cloud is Complicated and Error-prone 


There are some causes that make the migration process to Cloud complicated and error-prone. First, the computing 
environmental changes render many environment dependent configurations invalid. For example, as the database server is 
migrated from local data center to the Cloud, the IP address is possibly changed and this inevitably imposes the requirement 
of updating the IP address in all the components that depend on this database server. The migration process incurs large 
number of configuration update operations and even a single negligence of a single update may render the whole system out 
of operation. Second, the deployment of today’s enterprise system consists of large number of different components. For 
example, for load balancing purpose, there may be multiple web servers and application servers in the systems. Thus the 
dependencies among the many components are rather complicated and can be broken very easily in the migration process. 
Sorting the dependency out to restore the normal operational status of the applications may take much more time than the 
migration process itself. Third, there are massive hidden controlling settings which may be broken inadvertently in the 
migration process. For example, the access controls of different components may be rumpled, which confront the system to 
the security threats. Lastly, the human operators in the complicated migration process may make many careless errors which 
are very difficult to identify. Overall, the complicated deployments, the massive dependencies, and the lack of automation 
make the migration process difficult and error-prone. 


3 Related Work 


Most of the existing migration approaches are either done manually or limited to only certain types of applications. For 
example, the suggestions recommended by Opencrowd are rather high level and abstract and lack the concrete assistances to 
the migration problem [4]. The solution provided by OfficetoCloud is only limited to the type of Microsoft Office products 
and does not even scratch the surface of large application migration [5]. We argue that a systematic and realistic study on the 
complexity of migrating large scale applications to Cloud is essential to direct the Cloud migration efforts. Furthermore, an 
automatic and operational approach is highly demanded to simplify and facilitate the migration process. 

Nagaraja et al. [8] proposed a testbed for inserting faults to study the error behaviors. In our study, we study operator 
errors by migrating real practical applications from local data center to EC2. This forms a solid problem analysis context 
which motivates the effective solution for the migration problem. Vieira and Madeira [9] proposed to assess recoverability 
of database management systems through fault emulation and recovery procedure emulation. However, they assumed that 
human operators had the fault identification capability. In our work, we assume that human operators only have certain error 
identification capability but still cannot avoid errors. 

Thus an automated configuration management system is highly demanded. There is already intensive research work on 
design and evaluation of interactive systems with human operators involved in the field of human computer interaction. For 
example, Maxion and Reeder in [7] studied the genesis of human operator errors and how to reduce them through user 
interface. 


4 Migration Operations and Error Model 


In this section, we describe a series of application migration practices conducted in migrating typical applications from a 
local data center to EC2 Cloud platform. We first introduce the experimental setup and then discuss the migration practices 
on representative applications in details and in particular, we focus on the most common errors made during the migration 
process. Based on our observations, we build the migration error model through the categorization of the migration errors. 
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4.1 Experiment Setup 


Our experimental testbed involves both a local data center and EC2 Cloud. The local data center in College of Computing, 
Georgia Institute of Technology, is called “loki”, which is a 12-node, 24-core Dell PowerEdge 1850 cluster. Because the 
majority of today’s enterprise infrastructures are not virtualized, the physical to virtual (P2V) migration paradigm is the 
mainstream for migrating applications to virtualized cloud datacenters. In this work, we focus on P2V migration. 

We deliberately selected representative applications as migration subjects. These applications are first deployed in the 
local data center and then operators are instructed to migrate from local data center to the Cloud. Our hypothesis is that 
application migration to the cloud is a complicated and error-prone process and thus a semi-automated migration validation 
system can significantly improve the efficiency and effectiveness of application migration. With the experimental setup across 
the local data center and Amazon EC2 platform, we are able to deploy moderate enterprise scale of applications for migration 
from a real local data center to the real Cloud platform and test the hypothesis under the setting of real workload, real massive 
systems, and real powerful Cloud. 

We selected two types of applications in the migration case studies: Hadoop and RUBiS. These represent typical types of 
applications used in many enterprise computing systems today. The selection was made mainly by taking into account the 
service type, the architecture design and the migration content. 


e Hadoop [1], as a powerful distributed computing paradigm, has been increasingly attractive to many enterprises to 
analyze large scale data generated daily, such as Facebook, Yahoo, etc. Many enterprises utilize Hadoop as a key com- 
ponent to achieve data intelligence. Because of its distributed nature, the more nodes participating in the computation, 
the more computation power is obtained in running Hadoop. Thus, when the computation resources are limited at 
local site, enterprises tend to migrate their data intelligence applications to Cloud to scale out the computation. From 
the aspect of service functionality, Hadoop is a very typical representation of data-intensive computation applications 
and thus the migration study on Hadoop provides us good referential value on data intensive application migration 
behaviors. 


Hadoop consists of two subsystems, map-reduce computation subsystem and Hadoop Distributed File System (HDFS), 
and thus migrating Hadoop from local data center to the Cloud includes both computation migration and file system 
migration or data migration. Thus it is a good example of composite migration. From the angle of architecture design, 
Hadoop adopts the typical master-slave structure in its two layers of subsystems. Namely, in map-reduce layer, a job 
tracker manages multiple task trackers and in the HDFS layer, a NameNode manages multiple DataNodes. Thus the 
dependency relationships among multiple system components form a typical tree structure. The migration study on 
Hadoop reveals the major difficulties or pitfalls in migrating applications with tree-style dependency relationships. 


In our P2V experiment setup, we deploy a 4-node physical Hadoop cluster, and designate one physical node to work 
as NameNode in HDFS or job tracker in map-reduce and four physical nodes as DataNode in HDFS or task tracker in 
map-reduce (the NameNode or job tracker also hosts a DataNode or task tracker). The Hadoop version we are using 1s 
Hadoop-0.20.2. The migration job is to migrate source Hadoop cluster to the EC2 platform into a virtual cluster with 
A virtual nodes. 


e RUBiS [2] is an emulation of multi-tiered Internet services. We selected RUBiIS as a representative case of large scale 
enterprise services. To achieve the scalability, enterprises often adopt the muti-tiered service architecture. Multiple 
servers are used for receiving Web requests, managing business logic, and storing and managing data: Web tier, 
application tier, and database tier. Depending on the workload, one can add or reduce the computation capability at a 
certain tier by adding more servers or removing some existing servers. Concretely, a typical three tier setup consists of 
using an Apache HTTP server, Tomcat application server and MYSQL database as the Web tier, application tier and 
database tier respectively. 


We selected RUBiS benchmark in our second migration case study by considering the following factors. First, Internet 
service is a very basic and prevalent application type in daily life. E-commerce enterprises such as EBay, usually 
adopts multi-tiered architecture as emulated by RUBiS to deploy their services and this renders RUBiS a representative 
case of Internet service architecture migration. Second, the dependency relationship among the tiers of multi-tiered 
services follows an acyclic graph structure, rather than a rigid tree structure, making it a good alternative in studying 
the dependency relationship preservation during the migration process. Third, the migration content of this type of 
application involves reallocation of application, logic and data and thus its migration provides a good case study on 
rich content migration. In the P2V experiment setup, one machine installs the Apache HTTPD server as the first 
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tier, and two machines install the Tomcat application server as the second tier, and two machines install the MYSQL 
database as the third tier. 


In the following subsections, we introduce two migration case studies we have conducted: Hadoop migration and RUBiS 
migration, focusing mainly on configuration errors and installations errors. The configuration errors are our primary focus 
because they are the most frequent operator errors, some of which are also difficult to identify and correct. Installation errors 
can be corrected or eliminated by more organized installation steps or semi-automated installation tools with more detailed 
installation scripts and instructions. 

We first discuss the typical configuration errors in each migration case study based on our error categorization model, 
which classifies the configuration errors into seven categories: dependency preservation error, network connectivity error, 
platform difference error, reliability error, shutdown and restart error, software and hardware compatibility error, and access 
control and security error. Then we describe the common installation errors across both case studies. We illustrate each 
category of errors by examples through selecting a subset of the typical errors observed in our experiments. Finally we 
present the statistical results on the error distributions in each case study and across case studies. This experimental analytic 
study of major errors lays a solid foundation for the design of a semi-automated migration validation system that offers 
effective configuration management. 


4.2 Hadoop Migration Study 


In the Hadoop migration case study, we migrate the source Hadoop application from the local data center to EC2 platform. 
This section discusses the typical configuration errors observed in this process. 

Dependency Preservation. This is the most common error present in our experiments. Such a pitfall is very easy to 
make and very difficult to discover and may lead to disastrous results. According to the degree of severe impacts of this type 
of error on the deployment and migration, it can be further classified into four levels of errors. 

The first level of errors is the “dependency preservation” error generated when the migration administrator fails to meet 
the necessity of dependency preservation checking. Even if the dependency information presents explicitly, lacking of en- 
forcement to review the component dependency may lead to stale dependency information. For example, in our experiments, 
if the migration operator forgets to update the dependency information among the nodes in the Hadoop application, then the 
DataNodes (or task tracker) after migration will still initiate the connection with the old NameNode (or job tracker). This 
directly renders the system unoperational. 

The second level of errors in Hadoop migration is due to incorrect formatting and typos in the dependency files. For 
example, a typo hidden in the host name or IP address renders some DataNodes to be unable to locate the NameNodes. 

The third level of the dependency preservation error type is due to incomplete updates of dependency constraints. For 
example, one operator only updated the configuration files named “masters” and “slaves” which record the NameNode and 
list of DataNodes respectively. However, Hadoop dependency information is also located in some other configuration files 
such as “fs.default.name” in “core-site.xml” and “mapred.job.tracker” in mapred-site.xml. Thus Hadoop was still not able to 
boot with the new NameNode. This is a typical pitfall in migration, and is also difficult to detect by the operator because the 
operator may think that the whole dependency is updated and may spend intense efforts in locating faults in other locales. 

The fourth level of the dependency preservation error type is due to inconsistency in updating the number of machines 
in the system. Often, an insufficient number of updated machines may lead to unexpected errors that are hard to debug by 
operators. For example, although the operator realizes the necessity to update the dependency constraints and also identifies 
all the locations of constraints on a single node, the operator may fail to update all the machines in the system, which are 
involved in the system-wide dependency constraints. For example, in Hadoop migration, if not all the DataNodes update 
their dependency constraints, the system cannot run with the participation of all the nodes. 

Network Connectivity Bearing the distributed computing nature, Hadoop involves intensive communication across 
nodes in the sense that the NameNode keeps communication with DataNodes and job tracker communicates with task tracker 
continuously. Thus for such system to work correctly, inter-connectivities among nodes become an indispensible prerequisite 
condition. In our experiments, operators showed two types of network connectivity configuration errors after migrating 
Hadoop from the local data center to EC2 in the P2V migraton paradigm. 

The first type of such error is that some operators did not set the network to enable all the machines to be able to reach 
each other over the network. For example, some operators forgot to update the file “/etc/hosts” and led to IP resolvement 
problems. The second type of such error is local DNS resolution error. For example, some operators did not set the local 
DNS resolution correctly, which led to the consequence that only the DataNodes residing in the same host as the master node 
were booted after the migration. 
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Platform Difference The platform difference between EC2 Cloud and local data center also creates some errors in 
migrating applications. These errors can be classified into three levels: security, communication, and incorrect instance 
operation. In our experiment, when the applications are hosted in the local data center, the machines are protected by the 
firewalls, and thus even if the operators set simple passwords, the security is complemented by the firewalls. However, when 
the applications are migrated into the public Cloud, the machine can experience all kinds of attacks and thus too simple 
passwords may render the virtual hosts susceptible to security threats. The second level of the platform difference error 
type is related to the communication setting difference between cloud and local data center. For example, such error may 
occur after the applications are migrated into EC2 Cloud, if the communication between two virtual instances is still set in 
the same way as if the applications were hosted in the local data center. Concretely, for the operator in one virtual instance 
to ssh another virtual instance, the identify file which is granted by Amazon must be provided. Without the identify file, 
the communication within virtual instance cannot be set correctly. The third level of the platform difference error type is 
rooted in the difference between virtual instance management infrastructures. In the experiments, there were operators who 
terminated an instance but his actual intention is to stop the instance. In EC2 platform, termination of an instance will lead 
to the elimination of the virtual instance from Cloud and thus all the applications installed and all the data stored within the 
virtual instance are lost if data is not backed up in persistent storage like Amazon Elastic Block storage. Thus, this poses 
critical risks on the instance operations, because a wrong instance operation may wipe out all the applications and data. 

Reliability Error: In order to achieve fault tolerance and performance improvements, many enterprise applications like 
Hadoop and multi-tiered Internet services replicate its data or components. For example, in Hadoop, data is replicated in 
certain number of DataNodes, while in multi-tiered Internet services, there may exist multiple application servers or database 
servers. Thus after the migation, if the replication degree is not set correctly, either the migrated application fails to work 
correctly or the fault tolerance level is compromised. For example, in the experiments, there were cases in which the operator 
made errors that set the replication degree more than the total number of DataNodes in the system. The reliability errors are 
sometimes latent errors. 

Shutdown and Restart: This type of error means that the shutdown or restart operation in the migration process may 
cause errors if not operating correctly. For example, a common data consistency error may occur if Hadoop is incorrectly 
shuts down the HDFS. More seriously,a shutdown or restart error sometimes may compromise the source system. In our 
experiment, when the dependency graph was not updated consistently and the source cluster was not shut down completely, 
the destination Hadoop cluster initiated to connect to the source cluster and acted as the client to connect to the source cluster. 
As aresult, all the operations issued by the destination cluster actually manipulated the data in the source cluster and thus the 
source cluster data was contaminated. Such errors may create disastrous impacts on the source cluster and are dangerous if 
the configuration errors are not detected in time. 

Software and Hardware Compatibility: This type of error is less common in Hadoop migration than in RUBiS migration 
partly because Hadoop is built on top of Java and thus has better interoperability and also Hadoop involves a relatively smaller 
number of different components than RUBiS. Sometimes, the difference in software versions may lead to errors. For instance, 
the initial Hadoop version selected by one operator was Hadoop 0.19, which showed bugs in the physical machine. After the 
operator turned to the latest 0.20.2 version, the issue disappeared. 

Access Control and Security: It is noted that a single node Hadoop cluster can be set and migrated without root access. 
However, because a multi-node Hadoop cluster needs to change the network inter-connectivity and solve the local DNS 
resolution issue, the root access privilege is necessary. One operator assumed that the root privilege was not necessary for 
multi-node Hadoop installation and was blocked due to the network connectivity problem for about one hour and then sought 
help for access to the root privilege. 


4.3 RUBiS Migration Study 


In the RUBiS migration experiments, we migrate a RUBiS system with one web server and two application servers and 
two database servers from the local data center to EC2 Cloud. We below discuss the configuration errors present in the 
experiments in terms of the seven types of error categories. 

Dependency Preservation: Similar to Hadoop migration, the dependency preservation error type is also the most com- 
mon error in RUBiS migration. Because RUBiS has more intensive dependency among different components than Hadoop, 
operators made more configuration errors in the migration. For different tiers of a RUBiS system to run cooperatively, de- 
pendency constraints need to be specified explicitly in relevant configuration locales. For example, for each Tomcat server, 
its relevant information needs to be recorded in the configuration file named “ workers.properties” in Apache HTTPD server. 
The MYSQL database server needs to be recorded in the RUBiS configuration file named “mysql.properties’”. Thus an error 


USENIX Association LISA 711: 25th Large Installation System Administration Conference 297 


in any of these dependency configuration files will lead to the operation error. In our experiments, operators made different 
kinds of dependency errors. For example, some operator migrated the application but forgot to update the Tomcat server 
name in workers.properties. As a consequence, although the Apache HTTPD server was running correctly, RUBiS was not 
operating correctly because the Tomcat server could not be connected. One operator could not find the configuration file 
location to update the MYSQL database server information in RUBiS residing in the same host as Tomcat and this led to 
errors and the operator therefore gave up the installation. 

Network Connectivity: Relative to Hadoop migration, there is less node interoperability in a multi-tiered system like 
RUBiS, and different tiers present less needs on network connectivity, thus the network connectivity configuration errors are 
less frequently seen in RUBiS migration. One typical error was seen when the operator was connecting the Cloud virtual 
instance, he forgot to provide the identity file to enable two virtual instances to connect via ssh. 

Platform Difference : This error type turns out to be a serious fundamental concern in RUBiS migration. Because 
sometimes the instance rebooting operation may change the domain name, public IP and internal IP, even if the multi-tiered 
service is migrated successfully, a rebooting operation may render the application to service interruption. One operator 
finished the migration and after fixed a few configuration errors, the application was working correctly in EC2. After we 
turned off the system on EC2 for one day and then rebooted the service, we found that because the domain name had totally 
changed, all of the IP addresses or host name information in configuration files needed to be updated. 

Reliability Error: Due to the widely used replication in enterprise systems, it is typical that the system may have 
more than one application server and/or more than one database server. One operator spelt the name wrong for the second 
Tomcat server, but because there remained a working Tomcat server due to replication, the service was still going on without 
interruption. However, a hidden error as such was hidden inside the system and it may cause unexpected errors that could 
lead to detrimental damage and yet is hard to debug and correct. This further validates our argument that configuration error 
detection and correction tools are critical for cloud migration validation. 

Shutdown and Restart: This type of error shows that incorrect server start or shutdown operation in multi-tiered services 
may render the whole service unavailable. For example, the Ubuntu virtual instance selected for the MYSQL tier has a dif- 
ferent version of MYSQL database installed by default. One operator forgot to shut down and remove the default installation 
first before installing the new version of MYSQL and thus caused errors. The operator spent about half an hour to find the 
issuses and fixed them. Also we observed a couple of incidents where the operator forgot to boot the Tomcat server first 
before the shutdown operation, thus causing errors that are time consuming to debug. 

Software and Hardware Compatability: this type of error also happens frequently in RUBiS migration. The physical 
machine is 64 bits, while one operator selected the 32 bits version of mod_jk ( the component used to forward the HTTP 
request from Apache HTTPD server to Tomcat server ) and thus incompatibility issues occured. The operator was stuck for 
about two hours, and finally identified the version error. After the software version was changed into 64 bits, the operator 
successfully fixed the error. A similar error was observed where an operator selected an arbitrary MYSQL version which 
took about one hour for the failed installation and then switched to a newer version before finally successfully installed the 
MYSQL database server. 

Access Control and Security: This type of error also occurs frequently in RUBiS migration. For example, the virtual 
instance in EC2 Cloud bears the default feature of all ports closed. To enable the SSH operation possible, the security group 
where the virtual instance resides must open the corresponding port 22. Also one operator configured the Apache HTTPD 
server successfully but the Web server was unable to connect through port 80 and it took about 30 mins to identify the 
restrictions from EC2 documentation. Similar errors also happened for port 8080 which was for accessing Tomcat server. 
Another interesting error is that one operator set up the Apache HTTPD server, but forgot to set the root directory to be 
accessible and thus the index.html was not accessible. The operator reinstalled the HTTPD server but still did not discover 
the error. With the help of our configuration assistant, this operator finally identified the error and changed the access 
permission and fixed the error. We also found that operators also made errors in granting privileges to different users and one 
case was solved by seeking help in the MYSQL documentation. 


4.4 Installation Errors 


In our experiments-based case studies, we observe that operators may make all kinds of errors in installation or redeploy- 
ment of the applications in Cloud. More importantly, these errors seem to be common across all types of applications. In this 
section we classify these errors into the following categories: Context information error: This is a very common installa- 
tion error type. A typical example is that operators forget the context information they have used in the past installation. For 
example, the operators remembered the wrong path to install their applications and have to reinstall the applications from 
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scratch. Also if there are no automatic installation scripts or an incorrect or incomplete installation script is used, it can be a 
very frustrating experience with the same procedures repeated again and again. If the scale of the computing system is large, 
then the repeated installation process turns out to be a heavy burden for system operators. Thus a template based installation 
approach is highly recommended. 

Environment compatibility error : [In this migration case study, before any application can be installed, the computing 
environment compatibility needs to be ensured at both the hardware and software level. For example, there were migration 
failures created due to the small available disk space in virtual instance in migrating RUBiS. A similar errors is that the 
operator created a virtual instance with 32 bits operating system, while the application was a 64 bits version. Thus, it 
is necessary to check the environment compatibility before the application installation starts. An automatic environment 
checking process helps to reduce the errors caused by incorrect environment settings. 

Prerequisite resource checking error : This type of error is originated from the fact that every application depends on 
a certain set of prerequisite facilities. For example, the installations of Hadoop and Tomcat server presume the installation of 
Java. In the experiments, we observed that the migration or installation process were prone to be interrupted by the ignorance 
of installing prerequisite standard facilities. For example, the compilation process needs to restart again due the lack of 
“gcc” installation in the system. Thus, a complete check-list of the prerequisite resources or facilities can help us reduce the 
interruptions of the migration process. 

Application installation error: this error is the most common error type experienced by the operators. The concrete 
application installation process usually consists of multiple procedures. We found that the operator made many repeated 
errors even when the installation process for the same application was almost the same. For example, operators forgot the 
building location of the applications. Thus a template based application installation process will help facilitate the installation 
process. 


4.5 Migration Error Distribution Analysis 


In this section, we analyze the error distributions for each specific application and the error distribution across the appli- 
cations. 

Figure 1 and Figure 2 show 
the number of errors and per- 
centage of error distribution 12 
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study. In both figures, the X- 
axis indicates the error types 10 
as we analyzed in the previous 
sections. The Y-axis in Figure | 
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Y-axis in Figure 2 shows the 
share of each error type in terms 
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and platform difference error were the next most frequent error types, each taking 17% of the total errors. Network connec- 
tivity errors included local DNS resolution and IP address update errors. One typical platform difference error was that the 
termination of an instance led to the data loss. Interesting to note is that these three types of errors take 76% of the total errors 
and are the dominating types of the error occurrences observed in the experiments we conducted. 

Figure 4 and Figure 3 show the number of error occurrences and the percentage of error distribution for RUBiS migration 
case study respectively. There were a total of 26 error occurrences observed in this process and some errors fall into several 
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error categories. The dependency preservation error and access control and security errors were the two most frequent error 
types, each with 8 occurrences, taking 31% of the total erorrs. Together, both error types covered 62% of all the errors and 
dominated the error occurrences. It is interesting to note that the distribution of errors in the RUBiS migration case study was 
very different from the distribution in the Hadoop migration case study. For example, the number of access and security errors 
in RUBiS was 4 times the number of errors of this type in Hadoop migration. This is because RUBiS migration demanded the 
correct access control settings for many more entities than Hadoop. Not surprisingly, the majority of the access control errors 
were file access permission errors. This is because changing the file access permission is a common operation in setting up 
web services and sometimes operators forgot to validate whether the access permissions were set correctly or not. Also when 
there were errors and the system could not run correctly, the operators often ignored the possibility of this type of simple errors 
and thus led to longer time spent on error identification. For example, one error of this type took more than 1 hour to identify. 
Also there were more ports to open in RUBiS migration than in Hadoop migration, which also led to the high frequency 
of access control errors in RUBiS migration. RUBiS migration presented more software and hardware compatibility errors 
than Hadoop migration because the number of different components that were involved in RUBiS application is, relatively 
speaking, much more than in the typical Hadoop migration. Similarly, there were more “shutdown/restart” errors in the 
RUBiS migration. On the other hand, Hadoop migration presented more network connectivity errors and platform difference 
errors than RUBiS migration, because Hadoop nodes require more tightly coupled connectivity than the nodes in RUBiS. For 
example, the master node needs to have direct access without password control to all of its slave nodes. 

Figure 5 and Figure 6 sum- 
marize across Hadoop migra- 
tion and RUBiS migration case 
studies by showing the num- 
ber of error occurrences and 18 
the percentage of error distri- 
bution, respectively.The depen- 
dency preservation errors are 14 
the most frequent error occur- 
rences and accounted for 36% 
of the total errors. In practice, 
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number of error occurences. It was very easy for operators to change the file permissions to incorrect settings or some 
other habits which were fitting in local data center might render the application susceptible to security threats in the Cloud 
environment. The operational or environmental differences between Cloud and local data centers formed the third largest 
source of error, accounting for 12% of all the errors. Many common operations in local data center might lead to errors in 
Cloud if no adjustments to Cloud environment were made. These three types of errors dominated the error distribution, and 
accumulatively accounted for 68% of the total errors. In addition to these three types of errors, network connectivity was also 
an important source of errors, accounting for 107% of the total errors, because of the heavy inter-nodes operations in many 
enterprise applications today. The rest of errors accounted for 32% of the total errors. These error distributions provide a 
good reference model for us to build a solid testbed to test the design of our CloudMig migration validation approach to be 
presented in the subsequent sections of this paper. We argue that a cloud migration validation system should be equipped with 
an effective configuration management component that not only provides a mechanism to reduce the configuration errors, but 
also equips the system with active configuration error detection and debugging as well as semi-automated error correction 
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and repairs. 


5 Migration Validation with CloudMig 
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cess across different data centers by utilizing template-based installation procedures to simplify the migration process and 
utilizing policy-based configuration management to capture and enforce configuration related dependency constraints and 
improve migration assurance. 

The first prototype of CloudMig configuration management and validation system consists of four main components: the 
centralized configuration management engine, the client-based local configuration management engine, the configuration 
template management tool and the configuration policy management tool. The template model and the configuration policy 
model form the core of CloudMig for semi-automated installation and configuration validation system. In the subsequent 
sections we will briefly describe the functionality of each of these four components. 
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5.1 Configuration Template Model 


CloudMig uses a template as an effective mechanism to simplify the installation process. Template is a pre-formatted 
script-based example file containing place holders for dynamic and application-specific information to be substituted at 
application migration time for concrete use. 

In CloudMig, the installation and configuration management is operating in the unit of the application. That is, each 
application corresponds to a template set and a validation policy set. The central management server is responsible to manage 
the collection of templates and configurations on a per application basis and provides migration planning for the migration 
process. 

Recall that in the observations obtained from our migration experiments in Section 4, one big obstacle and source of errors 
in application migration is the installation and configuration process which is also a recurring process in system deployment 
and application migration. We propose to use the template approach to reduce the complexities of the installation process 
and reduce the chances of errors. An installation template is defined by an installation script with place holders for dynamic 
and application specific information. Templates simplify the recurring installation practice of particular applications by 
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substituting the dynamic information with new values. For example, in an enterprise system with 100 nodes, there will 
be multiple applications ranging from MYSQL database nodes, Tomcat application server nodes, to Hadoop distributed 
system nodes and so forth. Distributed applications may span and extend to more nodes on demand to scale out. For each 
application, its installation templates are stored in the installation template repository. These templates are sorted by the 
application type and an application identifier. The intuitive idea of template is that through information abstraction, the 
template can be used and refined for many similar nodes through parameter substitution to simplify the installation process 
for large scale systems. For example, if a Hadoop system consists of 100 DataNodes, then only a single installation template 
is stored in the installation template repository and each DataNode will receive the same copy of the installation template 
with only parameter substitution efforts needed before running the installation scripts to set up the DataNode component in 
each individual node. The configuration dependency constraints are defined in the policy repository to be described in the 
next subsection. CloudMig classifies the templates into the following four types: 


1. Context dictionary: This is the template specifying the context information about the application installation. For 
example, the installation path, the preassumed Java package version, etc. A context dictionary template can be as simple 
as a collection of the key-value pairs. Users specify the concrete values before a particular application installation. 
Dynamic place holders for certain key context information achieve the installation flexibility and increase the ability 
to find out the relevant installation information in the presence of system failures. 


2. Standard facility checklist template: This is the script template to check the prerequisites to install the application. 
Usually these are some standard facilities, such as Java or OpenSSH. Typical checklists include those for verifying 
the Java path setting, checking installation package existence, and so on. These checklists are common to many 
applications and are prerequisites for the success of installing the applications and thus performing a template check 
before the actual installation can effectively reduce the errors caused by ignorance of the checklist items. For example, 
both Hadoop and Tomcat server rely on the correct Java path setting and thus the correct setting of Java path is the 
prerequisite of successfully installing these two applications. In CloudMig, we collect and maintain such templates 
in a template library, which is shared by multiple applications. Running the checklist validation check can effectively 
speed up the installation process by reducing the amount of errors caused by carelessness on prerequisites. 


3. Local resource checklist template: This is the script template to check the hidden conditions for an application to be 
installed. A typical example is to perform the check of whether or not there is enough available disk space quota for a 
given application. Similarly, such resource checklist templates are also organized by application type and application 
identifier in the template library and utilized by the configuration management client to reduce the local installation 
errors and installation delay. 


4. Application installation template: This is the script template used to install a particular application. The context 
dictionary is included as a component of the template. Organizing installation templates simplifies the installation 
process and thus reduces the overhead in recurring installations and migration deployments. 


5.2 Configuration Policy Model 


In this section, we first introduce the basic concept of configuration policy, which plays the key role in capturing and 
specifying configuration dependency constraints and monitoring and detecting configuration anomalies in large enterprise 
application migration. Then we introduce the concept of continual query (CQ) and the design of a CQ enabled configuration 
policy enforcement model. 


5.2.1 Modeling Dependency Constraints with Configuration Policies 


A configuration policy defines an application-specific configuration dependency constraint. Here is an example of such 
constraints for RUBiS: for each Tomcat server, its relevant information needs to be specified explicitly in the configuration 
file named “workers.properties” in Apache HTTPD server. Configuration files are usually application-specific and usually 
specify the settings of the system parameters, the dependencies among the system components and thus directly impact the 
way of how the system is running. As enterprise applications scale out, the number of components may increase rapidly and 
the correlations among the system components evolve with added complexity. In term of complexity, configuration files fora 
large system may cover many aspects of the system configuration, ranging from host system information, to network setting, 
to security protocol and so on. Any typo or error may disable the operational behavior of the whole application system 
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as we showed and analyzed in the previous experiments. Configuration setting and management are usually a long term 
practice, starting from the time when the application is set up until the time when the application is ceased its use. During 
this long application life cycle, different operators may be involved in the configuration management practices and operate 
on the configuration settings based on their understandings, thus it further increases the probability of errors in confi guration 
management. In order to fully utilize resources, enterprises may bundle multiple applications to run on top of a single physical 
node, and the addition of new applications may necessitate the need to change the configurations of existing applications. 
Security threats such as viruses, also pose demands to effective configuration monitoring and management. 

In CloudMig, we propose to use policy 
as an effective means of ensuring the con- 
straints of configurations to be captured cor- 


rectly and enforced consistently. A policy _ 
icy 


can be viewed as a specialized tool to spec- pol 
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tain types of errors, even although errors are 
unavoidable. Here are a few configuration 
policy examples that operators may have in 
migrating a Hadoop system. 





Figure 7. CloudMig Architecture Design 


1. The replication degree can not be larger than the number of DataNodes 
2. There is only one master node 
3. The master node of Hadoop cluster should be named “dummy 1” 


4. The task tracker node should be named “dummy2” 


As the system evolves and the configuration repository grows, performing such checking manually will become a heavy 
and error-prone process. For example, in enterprise Internet service systems, there may be hundreds of nodes, and the 
configuration of each node needs to follow certain constraints. For load balancing purpose, different Apache HTTPD servers 
correspond to different sets of Tomcat servers. Incorrect setting of relevant configuration entries will directly lead to an 
unbalanced system and even cause the system to crash when workload burst happens. With thousands of configuration entries, 
hundreds of nodes, and many applications, it is impractical if not impossible to perform manual configuration correctness 
checking and error correction. We argue that a semi-automated configuration constraint checking framework can greatly 
simplify the migration configuration and validation management of large scale enterprise systems. In CloudMig, we advocate 
the use of continual query as the basic mechanism for automating the configuration validation process of operator-defined 
configuration policies. In the next section we will describe how CQ-enabled configuration policy management engine can 
improve the error detection and debugging efficiency, thus reducing the complexity of migrating applications from a local 
data center to Cloud. 
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5.2.2 Continual Query Based Policy Model 


In CloudMig, we propose a continual query based policy specification and enforcement model. A continual query (CQ) [6] 
is defined as a triple in the form of (Query, Trigger, Stop). A continual query (CQ) can be seen as a standing query, in which 
the trigger component specifies the monitoring condition and is being evaluated periodically upon the installation of the CQ 
and whenever the trigger condition is true, the query component will be executed. The Stop component defines the condition 
to terminate the execution of the CQ. Trigger condition can be either time-based or content-based, such as “checking the free 
disk space every hour or trigger a configuration action when the free disk space is less than 1GB”. 

In CloudMig, we define a policy in the format of continual query and refer to the configuration policy as the Contiual 
Query Policy (CQP), denoted by : CQP(policyID, appName, query, trigger, action, stopCondition). Each element of the CQP 
is defined as follows: 


1. policyID is the unique numeric identifier of the policy. 
2. appName is the name of the application that is installed or migrated to the host machine. 


3. query refers to the search of matching policies and the execution of policy checking. The query can be a Boolean 
expression over a simple key-value repository or SQL-like query or XQuery on a relational database of policies. 


4. trigger is the condition upon which the policy query will be executed. Triggers can be classified into time-based or 
content-based. 


5. action indicates the action to be taken upon the query results. It can be a warning flag in the configuration table or an 
warning message sent by email or displayed on the command line of an operator’s terminal. 


6. stopCondition is the condition upon which the CQP will stop to execute. 


An example CQ-based policy is to check whether the replication degree is larger than the number of DataNodes in Hadoop 
prior to migration or changing the replica factor (replication degree) or reducing the number of DataNodes. Whenever 
the check returns a true value, send an alert to re-configure the system. Clearly, the query component is responsible for 
checking if the replication degree is larger than the number of DataNodes in Hadoop. The trigger condition is Hadoop 
migration or changing the replica factor (replication degree) or reducing the number of DataNodes. The action is defined as 
re-configuration of the Hadoop system upon the true value of the policy checking. In CloudMig, we introduce default stop 
condition of one month for all CQ-enabled configuration policies. 


5.3 CloudMig Server side Template Management and Policy Management 


CloudMig aims at managing the installation templates and configuration policies to simplify the migration for large scale 
enterprise systems which may be comprised of thousands of nodes with multi-tier applications. Each application has its own 
configuration policy set and installation template set and the whole system needs to manage a large collection of configuration 
policies and installation templates. The CloudMig server side configuration management system helps to manage the large 
collection of templates and configuration policies effectively by providing system administrators (operators) with convenient 
tools to operate on the templates and policies. Typical operations include policy or template search, indexing, application 
specific packaging and shipping, to name a few. Detaching the template and policy management from individual application 
and utilizing a centralized server also improves the reliability of CloudMig in the presence of individual node failures. 

In CloudMig, the configuration management server operates at the unit of a single application. Each application corre- 
sponds to an installation template set and a configuration validation policy set. The central management server is responsible 
for managing the large collection of configurations on a per application basis and providing migration planning to speed up 
the migration process and increase the assurance of application migration. Concretely, the configuration management server 
mainly coordinate the tasks of the installation template management engine and the configuration policy management engine. 
Installation Template Management Engine. 


As shown in Figure 7, the installation template management engine is the system component which is responsible for 
creating template, update template, advise the template for installation. It consists of a central template repository and a 
template advisor. The template repository stores and maintains the template collections of all the applications. The template 
advisor provides the operators with the template manipulation capabilities such as creating, updating, deleting, searching and 
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Table 1. Migration Error Detection Rate 


Migration Type 
Hadoop migration 


RUBIS migration 





indexing templates over the template repository. On a per application basis, operators may create an application template 
set, add new templates to the set, update templates from this set or delete templates. The template advisor assumes the job 
to search and dispatch templates for new installations and propagate template updates to corresponding application hosting 
nodes. For example, during the process of RUBiS installation, for a specific node, the template advisor dispatches the 
appropriate template depending on the server type (web server, application server or database server) and transmits (ships) 
the new installation set to the particular node. 

Concretely, for each application, the central installation template management engine builds the context library which 
stores all the source information in the key-value pairs, and selects a collection of standard facility checklist templates which 
apply to the particular application, and pick a set of local resource checklist templates as the checklist collection for the 
application, and finally builds the specific application installation template. The central management engine then bundles the 
collections of templates and policies for the particular application and transmits the bundle to the client installation template 
manager to start the installation instance. 


Configuration Policy Management Engine. 

As the central management unit for the policies, the policy engine consists of four components: policy repository, config- 
uration repository, policy advisor, and action manager. Together they cooperate to provide the service to create, maintain, 
dispatch and monitor policies and execute the corresponding actions based on the policy execution results. Concretely, we 
below describe the different components of the policy engine: 


1. The policy repository is the central store where all the policies for all the applications are maintained. It is also 
organized on a per application basis. Each application corresponds to a specific policy set. This policy set is open to 
addition, update, or delete operations. Each policy corresponds to a constraint set on the application. 


2. The policy advisor works on the policies in the policy repository directly and provides the functionalities for application 
operators to express the constraints in the form of CQ-based policy. Application operators creates policies through this 
interface. 


3. The configuration repository stores all the configuration files on a per application basis. It ships the configurations from 
the CloudMig configuration server to the local configuration repository on the individual node (cient) of the system. 


4. The action manager handles the validation results from the policy validator running on client and triggers the corre- 
sponding action based on certain policy query result, in the form of an alert through message posting or email or other 
notification methods. 


5.4 CloudMig Configuration Management Client 


The CloudMig configuration management client is running at each node of a distributed or multi-tier system, which is 
responsible for managing the configuration policies related to the node locally. Corresponding to the CloudMig configuration 
management engine at the server side, CloudMig client works as a thin local manager for the templates and policies which 
only apply to a particular node. A client engine mainly consists of two components: client template manager and client policy 
manager. 

Client Template Manager. 

Client template manager manages the templates for all the applications installed in the host node on per application basis. It 
consists of three components: template receiver, template operator and local template repository. The template receiver re- 
ceives the templates from the remote CloudMig configuration management server and delivers the templates to local template 
manager. The local template manager installs the application based on the template with necessary substitution operations. 
The local template manager is also responsible for managing the local template repository which stores all the templates for 
the applications that reside at this node. 
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The concrete process of tem- 
plate based installation works 
as follows: after the client 
template manager receives 
the collection of installation 
templates from the server side 
installation template man- 
agement engine, it will run 
the local resource checklist 
templates first to detect if there 
are any prerequisite checklist 
items which are not met. For 
example, it checks if the avail- 
able disk space is less than the 
amount needed to install the 
application, or if the user has 
the access permissions to the 
installation path, etc. Next, 
the standard facility checklist 
template will run to detect 
if all the standard facilities 
are installed or not. Finally, 
the dynamic information in 
application specific templates 
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Figure 8. Hadoop error detection 


are substituted and the context dictionary is integrated to run this normal installation process. 


Client Policy Manager. 


There is a client policy man- 
ager residing together with the 
host node to manage the poli- 
cies for the local node. It 
mainly consists of policy re- 
ceiver, policy validator, lo- 
cal policy repository and lo- 
cal config repository. The pol- 
icy receiver receives the poli- 
cies transmitted from the pol- 
icy advisor in the central policy 
server, and stores the policies 
in the local policy repository. 
The local config repository re- 
ceives the configuration data di- 
rectly from the central config 
repository. The local policy 
validator runs each policy. It 
retrieves the policy from local 
policy repository and searches 
the related configuration data to 
run the policy upon the config- 
uration data. The policy valida- 
tor transmits the validation re- 
sults to the action manager in 
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Figure 9. RUBiS error detection 
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the central server to take the alert actions. 


6 Case Studies with CloudMig 


We run CloudMig with the 

same set of operators on the 
same set of case studies af- Migration Error Detection 
ter the manual migration pro- 
cess is done (recall Section 4). 18 
We count the number of er- 
rors that are detected by Cloud- 
Mig configuration management 14 
and installation automation sys- 
tem. We show through a set of 
experimental results below that 
CloudMig overall achieves high 
error detection rate. 

Figure 8 shows the error de- 
tection results for Hadoop mi- 
gration case study. As one can 
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. dependency network platform reliability shutdown/restart software and access control 
dency preser vation errors, net- preservation difference hardward and security 


work connectivity errors, shut- 

down restart errors, and all the Error category 

access control errors. This con- 

firms the effectiveness of the Figure 10. Overall migration error detection 

proposed system, in that it can 

detect the majority of the con- 

figuration errors. The two types of error that can not be fully detected are platform difference error and software/hardw are 
compatibility errors. For platform difference errors, this is because the special property of the platform difference error 
requires the operators to fully understand the uniqueness of the particular Cloud platform first. As long as the operator 
understands the platform sufficiently, for example, by lessons learned from others or policies shared by others, we believe 
that such errors can be reduced significantly as well. The reason that current implementation of CloudMig cannot de- 
tect software/hardware compatibility errors notably is due to the quality of the default configuration data which lacks of 
application-specific software/hardware compatibility information. Although in the first phase of implementation, we mainly 
focus on the configuration checking triggered by the original configuration data, we believe that as operators weave more 
compatibility policies into CloudMig policy engine, such type of errors can also be reduced significantly. As Table 1 shows, 
totally CloudMig could detect 83% of the errors in Hadoop migration. 

Figure 9 shows the error detection result for RUBiS migration case study. In this study, we can see that CloudMig can 
detect all the dependency preservation errors and reliability errors. 

However, because multi-tiered Internet service system involves a higher number of different applications, it leads to more 
complicated software/hardware compatibility issues compared to the case of Hadoop migration. In the experiments reported 
in this paper we are focusing on the configuration driven by the default configuration policies, which lacks of adequate 
software/hardware compatibility policies for RUBiS, thus CloudMig system did not detect the software/hardware errors. On 
the other hand, this result also indicates that in the RUBiS migration process, the operators are suggested to pay special 
attention to the software/hardware compatibility issues because such errors are difficult to detect with automated tools. It 
is interesting to note that the CloudMig was able to detect only half of the access control errors in RUBiS. This is because 
these errors include MYSQL privilege grant operations which are embedded in the application itself and the CloudMig 
configuration validation tool cannot intervene with the internal operations of MYSQL. Overall, CloudMig detected 55% of 
the errors in RUBiS migration as shown in Table 1. 
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Figure 11. Overall migration error detection ratio 


Figure 10 and Figure 11 
show the number of detected er- 
rors and the error detection ra- 
tio of each error type summa- 
rized across the Hadoop migra- 
tion case study and RUBiS mi- 
gration case study respectively. 
Overall, CloudMig can detect 
all the dependency preservation 
and reliability errors and 80% 
of the network errors and 60% 
of the access control and se- 
curity errors. In total, these 
four types of errors accounted 
for 74% of the total error oc- 
currences. For shutdown/restart 
errors, CloudMig detected 50% 
of such errors and did not detect 
the software/hardware compat- 
ibility errors. This is be- 
cause the application config- 
uration data usually contains 
less information related with 
shutdown/restart operations or 
software/hardware compatibil- 
ity constraints and this fact 


Migration Error Detection Distribution 
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Figure 12. Overall migration error detection percentage. The legend 
lists the error types in the decreasing percentage order. 


makes the configuration checking on these types of errors difficult without adding additional configuration policies. Fig- 
ure 12 shows the percentage of error types in the total number of detected errors. One can see that 51% of the detected errors 
are dependency preservation errors, and 17% of the detected errors are network errors. Table 1 shows that totally across all 
the migrations, the error detection rate of CloudMig system is 70%. 

Overall these experimental results show the efficacy of CloudMig in reducing the migration configuration errors, simpli- 
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fying the migration process and increasing the level of assurance of migration correctness. 


7 Conclusion 


We have discussed the system migration challenge faced by enterprises in migrating local data center applications to the 
Cloud platform. We analyze why such migration is a complicated and error-prone process and pointed out the limitations of 
the existing approaches to address this problem. Then we introduce the operator-based experimental study conducted over 
two representative systems (Hadoop and RUBiS) to investigate the error sources. From these experiments, we build the error 
classification model and analyze the demands for an semi-automated configuration management and migration validation 
system. Based on the operator study, we design the CloudMig system with two unique characteristics. First, we develop 
a continual query based configuration policy checking system, which facilitate operators to weave important configuration 
constraints into continual query policies and periodically run these policies to monitor the configuration changes and detect 
and alert the possible configuration constraints violations. In addition, CloudMig combines the continual query based policy 
checking system with the template based installation automation system, offering effective ways to help operators reduce the 
installation errors and increase the correctness assurance of application migration. Our experiments show that CloudMig can 
effectively detect a majority of the configuration errors in the migration process. 
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Abstract 


System administrators use a variety of techniques to 
track down and repair (or avoid) problems that occur in 
the systems under their purview. Analyzing log files, 
cross-correlating events on different machines, establish- 
ing liveness and performance monitors, and automating 
configuration procedures are just a few of the approaches 
used to stave off entropy. These efforts are often stymied 
by the presence of hidden dependencies between com- 
ponents in a system (e.g., processes, pipes, files, etc). In 
this paper we argue that system-level provenance (meta- 
data that records the history of files, pipes, processes and 
other system-level objects) can help expose these depen- 
dencies, giving system administrators a more complete 
picture of component interactions, thus easing the task 
of troubleshooting. 


KEYWORDS: troubleshooting; diagnosis; depen- 
dencies; provenance; mental models. 


1 Introduction 


Most highly experienced system administrators can re- 
member a time in their career when they were virtu- 
ally clueless about the configuration of their systems. 
Whether learning on the job as a junior sysadmin or 
walking into a brand new infrastructure, nobody is ever 
handed a comprehensive guide to “the way things work 
around here.” Instead, sysadmins must slowly develop a 
mental model of the systems in their care [6, 15]. They 
study existing documentation and Internet sources, so- 
licit expert advice, explore component interactions, and 
much more. While this process is valuable in the long 
run, it is also time-consuming and error prone, and com- 
petes with the efficiency of whatever task is at hand (e.g., 
tracking down and fixing the root causes of problems). 


Additionally, mental models are developed on an as- 
needed basis and fail to account for hidden dependencies 
between system components, resulting in large gaps and 
inaccuracies. 

This paper explores how system-level provenance can 
effectively expose hidden dependencies, improve men- 
tal models, and help improve the troubleshooting process 
for system administrators. Our goal is to build a prove- 
nance analysis engine that can automatically construct an 
accurate, queryable map of component interactions for 
single systems, networked sites, and beyond. Imagine ar- 
riving at your desk on a Monday morning and being able 
to explore what your site looks like based on provenance 
collected over the weekend. 


2 Dependencies 


Efficient troubleshooting requires mental models that 
are sufficiently accurate and complete to suggest proper 
courses of action. One part of a good mental model is 
a map of dependencies between the various components 
in a system. At a high level, components can be thought 
of as subsystems (e.g., the web subsystem depends upon 
the filesystem). At the lowest level of abstraction, com- 
ponents consist of programs and their individual config- 
uration parameters. At this level, a good mental model 
maps how parameter changes affect a program’s depen- 
dencies. 

For the purposes of this research, we loosely define 
dependency as the relationship created when information 
flows from one component to another in order for the re- 
cipient of that information to function correctly. For ex- 
ample, when a process loads a library, functions neces- 
sary to the core behavior of the process are transmitted to 
it from a file. The process is dependent upon the library 
being loaded into some part of memory and being made 
accessible. Likewise, when Apache starts, it reads neces- 
sary parameters from an external source of information 
(e.g., httpd.conf). Furthermore, Apache depends upon 
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its runtime environment to properly specify the location 
of httpd.conf. 

These are obvious examples of dependencies, but note 
that the way in which we have defined dependency re- 
quires a clear understanding of what it means for a com- 
ponent to function correctly. Formally, functional cor- 
rectness is determined by behavior: every input produces 
correct output, where the output also comprises error 
conditions. Thus, if a process outputs “file not found” 
for some input, it may still be functioning correctly. But 
this definition is too strict for our purposes. 

System administrators have a general sense of how 
components are supposed to behave, and they can usually 
determine when something is awry. For example, mis- 
configuration of one or more components is a frequent 
cause of “abnormal” behavior. Formally, a DBMS that is 
configured with a parameter that directs it to the wrong 
dataset will produce the correct behavior for how it is 
configured, 1.e., it will still answer queries as directed, 
etc. But the admin will see unexpected outputs because 
the inputs were different than expected. This leads us 
to an imprecise definition of “functioning correctly” as 
“exhibiting expected behavior’. 


3 The PASS Project 


Digital provenance is metadata that describes the ances- 
try or history of a digital object. In non-digital domains, 
such as art curation, provenance is often collected man- 
ually. But in the digital domain, we have the capabil- 
ity to record provenance automatically. The provenance- 
aware storage system (PASS) project [26] currently col- 
lects system-level provenance from inside a running ker- 
nel and builds a directed acyclic graph that describes an- 
cestral relationships between files, pipes, and processes!. 

The provenance graph would be virtually useless with- 
out a way of extracting pertinent information. We have 
developed a query language for graph-structured data 
called PQL [14, 13], which is capable of expressing com- 
plex queries with transitive closures. PQL operates on a 
semi-structured data model that allows us to ask ques- 
tions about ancestors and descendants as well as about 
paths and subgraphs. 

Consider the case in which we want to find all out- 
puts of the sendmail daemon. The following SPARQL? 
query produces the desired result: 


SELECT ? output WHERE { 
progfile ”/usr/sbin/sendmail” ?process 
output output—of ?process 


rs 


!This includes variables and other information about the environ- 
ment in which they execute. 

*PQL is similar to SPARQL [32], an SQL-like query language for 
RDF. 
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Figure 1: A diagram of the PASS Architecture. 


In this example, ?output and ?process are variables. 
For every process that is an instantiation of sendmail, 
the query will return the process’s output objects (e.g. 
files, pipes, processes, etc) in the variable ?output. With 
PQL or a similar graph query language, we can issue 
simple queries such as the one in our example or com- 
plex queries such as “‘find all objects that result from the 
same (or similar) sequence of events”, which is a path 
finding query. 

If we think of files, pipes, and processes as system 
components between which information flows, then the 
provenance graph can be viewed as a graph of potential 
dependencies. Nodes of the graph represent components 
and edges represent a “may depend upon” relationship 
from one component to another. In practical terms, for a 
process P that reads from a file F’, there exists a directed 
edge from the descendant P to the ancestor F’. Likewise, 
if the same process writes to a pipe J, an edge from J to 
P will be generated in the graph. The graph describes 
only potential dependencies, because in the absence of 
code and dataflow analysis, we cannot be certain that 
any descendant depends upon its ancestors to function 
correctly. 


3.1 PASS Architecture 


Figure 1 shows the PASS architecture.? The intercep- 


tor 1s a set of system call hooks that extract arguments 
and other necessary information from kernel data struc- 
tures, passing them to the observer. Currently, PASS 
intercepts execve, fork, exit, read, readv, write, 
writev, mmap, open, pipe, and the kernel operation 
drop_inode. These calls are sufficient to capture the 


>The modified image and description of architecture are used with 
permission from the authors [26]. 
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rich ancestry relationships between Linux files, pipes, 
and processes. In addition, applications can be compiled 
to use libpass, which allows us to send application- 
specific provenance directly to PASS. 

This raw “proto-provenance” goes to the observer, 
which translates proto-provenance into provenance 
records. For example, when a process P reads a file A, 
the observer generates a record that includes the fact that 
P potentially depends upon A (1.e., a cross-reference to 
A). The first time an object is created, the observer as- 
signs to it a unique pnode identifier. A pnode number is 
similar to an inode number except that it is never recy- 
cled, even after an object is destroyed. This allows us to 
maintain provenance for every object of interest that ever 
existed. Suppose that files A and B and process P have 
all been assigned pnodes. When P exits, its pnode must 
be maintained so that the transitive potential dependency 
of B upon A can be queried. The same logic holds for the 
case in which A is deleted. 

The analyzer then processes the stream of provenance 
records to eliminate duplicates and cyclic dependencies. 
Duplicates occur when a provenance object is used as 
input multiple times in the same “session” by another 
object. For example, after the initial read of pipe J by 
process P, every further read creates a duplicate record 
until the pipe is closed. Yet a record of the initial read is 
all we require to posit a potential dependency.’ In similar 
fashion, the provenance of a file F' to which P has written 
multiple times will only contain a single record of the 
initial write. 

Unless time travel is possible>, it is impossible for a 
descendant object to affect its ancestor. This is why cy- 
cles in the provenance graph must be broken or avoided 
by the analyzer. PASS avoids cycles by versioning. At 
the time of its creation, each provenance object is as- 
signed a version number of 0. New versions of an object 
will be assigned monotonically increasing numbers. If 
process P reads from file F’, and later writes to that same 
file, the analyzer will avoid a cycle by versioning the 
file’s provenance. If P reads the file again, the new record 
for this event will contain a cross-reference to F'v;. That 
is to say that once F is written, further provenance will 
be collected only for subsequent new versions, and the 
provenance of F'vg will contain only whatever may have 
happened to F' prior to the write and that didn’t involve 
a cycle. The versioning algorithm works well on cycles 
of any length, involving any type of object at any version 
level. 

The PASS system is not limited to collecting prove- 
nance from local storage. We have implemented exten- 


*Tn general, this level of granularity imposes limitations on our abil- 
ity to classify dependencies, 1.e., we could keep the duplicates with a 
timestamp for more accurate resolution. 

>There is now strong evidence to suggest that it is not [40]! 


sions that enable provenance collection from NFS shares 
and Amazon’s S3 service [3]. This capability is espe- 
cially important to the multitude of organizations that 
have shifted their infrastructure into the cloud [27, 28]. 

Note that the interceptor 1s platform-specific by neces- 
sity, but that the observer and analyzer can be separated 
entirely from the operating system. The remaining com- 
ponents of the PASS architecture are not germane to the 
goals of this research. For a more complete description, 
we direct the reader to several prior works [25, 26]. 


4 Troubleshooting 


4.1 Related Work 


In the past decade, there has been exciting research 
on improving failure diagnosis for system administra- 
tors. Some approaches use visualization to help opera- 
tors rapidly detect and diagnose problems [36]. Others 
use event correlation in log-file analysis to identify ex- 
tant and potential problems [1, 12, 17, 20, 34]. Wang et 
al. [37, 38] use comparisons of current system configu- 
rations against golden state configurations that have been 
generated via statistical analysis of machine populations. 
The HPC community has made significant strides in 
tracking down and diagnosing the root causes of failures 
in grids and clusters [2, 9, 31, 39]. Most of these ap- 
proaches rely upon log analysis and can be extremely 
effective, especially in prescribed domains. However, 
log analysis may suffer from several drawbacks, includ- 
ing a lack of operational context (expected behavior); a 
“butterfly” effect on log messages that stem from small 
changes; corrupted messages; inconsistent log formats; 
and asymmetric log reports [29]. 

In the absence of formal documentation, sysadmins 
have few resources for determining the dependencies 
of a program. There exist tools that support static ex- 
traction of dependencies via analysis of package man- 
agement repositories [18] and program images [35], but 
these have quite limited capabilities. For example, the 
former tool relies upon the correctness of package pre- 
requisite information, and the latter tool only exposes 
compile-time dependencies. 

Some tools [7, 33] are able to automatically construct 
operational dependency models by actively perturbing 
or probing live systems. Active perturbation involves 
performing multiple transactions or injecting “problems” 
outside of normal operation and tracking the affected 
components by observing likely execution paths. These 
methods are invasive, with the potential to cause un- 
wanted load or unforeseen failures, and thus may be un- 
tenable in a production environment. 

There are also many other approaches for exposing 
complex dependencies and causal relationships in dis- 
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tributed systems[4, 8, 11, 30], but their ability to docu- 
ment, present, and query the models they build is limited. 
This makes them ill-suited for improving mental models 
and for generalized system and site-wide troubleshoot- 
ing. 

Two research projects reflect well the philosophy we 
wish to propagate. PDA is a tool for automated prob- 
lem determination developed at IBM [16]. The tool starts 
with high-level health indicators that trigger custom-built 
probes when something is awry. The probes are built 
manually via analysis of trouble-ticket corpora. Their 
use-case scenarios reveal that a large number of prob- 
lems fall into several categories to which standard trou- 
bleshooting procedures can be applied and perhaps even 
automated. We are optimistic that these categories will 
also manifest in our provenance graphs. 

We recently discovered a tool that is similar—both in 
concept and implementation—to the framework we pro- 
pose in this paper, but more narrow in scope and no 
longer actively developed. BackTracker [19] is designed 
to analyze system intrusions by tracing chains of events 
from a detection point (e.g., a suspicious process) back 
through a dependency graph to likely points of entry. The 
goal is to document the attack vectors that expose un- 
known vulnerabilities. Similar to our approach, the graph 
is constructed by intercepting and recording the informa- 
tion in system calls. The authors also provide several 
security-specific methods by which to prioritize and fil- 
ter large portions of the dependency graph to help the 
user along. The requirements for system troubleshoot- 
ing are more general, thus our work may be viewed as 
an attempt to address a superset of the issues tackled by 
BackTracker. 

Although one may assume that documentation is avail- 
able for general-use tools, many organizations develop 
in-house solutions. When these solutions are intended 
for internal use only, there is little economic incentive 
to create polished user interfaces or comprehensive doc- 
umentation; tools must simply be “good enough.” As 
the number of internal libraries, scripts, and programs 
increases, making changes to the system becomes in- 
creasingly difficult. For example, deleting old libraries 
becomes virtually impossible when sysadmins have lit- 
tle knowledge of what programs utilize which libraries. 
The complexity of these poorly understood systems will 
continue to grow without bound as long as they are ac- 
tively developed. Sysadmins in this situation would ben- 
efit greatly from a comprehensive and explorable graph 
of component dependencies. 


4.2 A ‘Simple’ Example 


As suggested earlier, a clear and accurate system model 
is paramount to troubleshooting. Although sysadmins al- 
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ready troubleshoot in the absence of such models, their 
efforts have been significantly hindered by complexity. 
When something fails in a system, knowing where to 
look first is usually a “gimme”. Under progressively 
greater pressure, knowing where to look second, third, 
fourth, and so on, requires experience and perseverance. 

For example, in most UNIX distributions, the re- 
solver, which sends DNS queries to translate names 
into IP addresses, loads its configuration from the file 
/etc/resolv.conf. Traditionally, this file was edited 
manually. In modern distributions such as Ubuntu, the 
file is now automatically generated and modified by the 
NetworkManager daemon. Various options for the net- 
work manager can be configured via GUI or the com- 
mand line, but not resolver-specific options. Instead, 
if the host obtains its network configuration via DHCP, 
changes to resolv.conf are governed by the network 
manager’s communication with the dhclient daemon, 
using D-Bus IPC°. The behavior of dhclient is in turn 
configured via the file /etc/dhcp3/dhclient.conf. 

Given the dependencies just described, where does the 
system administrator look when she determines there is 
a problem with name resolution? The first place she 
may look is resolv.conf. Luckily for her, there is 
a comment in the file that states it has been automati- 
cally generated by the network manager. However, this is 
where the trail goes lukewarm. The manual page for the 
network manager says nothing about the resolver. Per- 
haps the sysadmin recalls that name resolution failures 
can be symptomatic of DHCP misconfiguration, lead- 
ing her to check the dhclient manpage and subsequently 
dhclient.conf. She may find some useful information 
there, but she is hard pressed to discover that the net- 
work manager is modifying the resolver’s configuration 
by talking to the DHCP client. Also, dhclient .conf 
may have been configured by an automated script. The 
trail goes cold until Google is consulted and a solution is 
discovered. But this is unsustainable as a standard proce- 
dure for troubleshooting; eventually, even Google is out 
of answers. 

Using a provenance graph (Figure 2) and the night 
query types (or tools we build specifically for this pur- 
pose), our fearless administrator would more quickly dis- 
cover the dependencies in our example. Let us walk 
through the troubleshooting session once more with the 
help of provenance. The graph has been trimmed and 
condensed for clarity, so the steps taken in an actual ses- 
sion may be more involved. Also, the following analysis 
suggests that we are able to collect provenance for re- 
mote sockets. This is not currently the case for PASS, 
but we are working on such a mechanism. 


©The D-Bus implements inter-process communication (IPC) via 
Unix sockets, with each endpoint represented as an inode object and 
two file objects in the kernel. 
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Figure 2: A partial provenance graph representing potential 
dependencies between components involved in Linux name 
resolution. 


We may Safely start at the network manager node 
(hereafter referred to as netman), since we already know 
the source of the generated resolv. conf. The ancestors 
of netman include a socket endpoint (a “special” file) and 
various other inputs, one of which will be a configuration 
file. We can probably safely exclude the configuration 
file, because there is nothing in netman’s documentation 
about resolver options. But why has netman received in- 
formation from a socket via the D-Bus? It has obviously 
communicated with another process. Here is where we 
run into a slight snag: D-Bus often has a plethora of 
socket endpoints as inputs (in addition to other inputs), 


so how can we determine the right ancestor? In many 
cases we may not be able to directly identify the most 
important ancestor but we can probably narrow down our 
choices. 

One possibility involves checking timestamps of the 
provenance edges between objects of interest. In this 
case we could compare the timestamp of outputs to net- 
man’s socket endpoint with the timestamps of D-Bus in- 
puts from any of its ancestors. We would discard D-Bus 
inputs that occurred after outputs to the socket as well as 
inputs that occur too long before outputs. Other techni- 
cal solutions are also possible, including the recording of 
socket descriptors in provenance objects. 

Once we have reasonably narrowed our choices, we 
will have to rely on experience to take us the rest of the 
way. Knowing that our machine receives network con- 
figuration parameters via DHCP will allow us to discard 
many other D-Bus ancestors, such as the audio, printing, 
and display subsystems. Once we reach the dhclient an- 
cestor, we can determine which of its configuration op- 
tions found in dhclient.conf are likely to be involved 
in name resolution. 

The D-Bus example represents one of the worst-case 
scenarios in tracing root causes. The problem is twofold: 
at any given time, the number of ancestors and descen- 
dants of the daemon is usually very large, which re- 
sults in an overwhelming path explosion; but the larger 
problem is that valuable provenance is hidden inside the 
D-Bus black box. For instance, if we had access to 
the internal dbus object name that identifies the connec- 
tion between two clients, we could easily narrow our 
search to the real ancestors of the network manager. One 
way in which to accomplish this would be to create a 
provenance-aware version of D-Bus using the libpass 
library. This may be feasible for a small portion of par- 
ticularly “opaque” system programs with many distinct 
inputs and outputs. 


5 Ranking Dependencies 


While the provenance of a process’s outputs depends 
upon the process’s inputs, the process itself is not neces- 
sarily dependent upon every input to function correctly. 
For example, the program cat, which reads the contents 
of an input stream, only depends upon three shared li- 
braries to function correctly, yet a provenance graph in- 
cludes edges to every distinct input object that cat opens. 
Though the absence of these inputs may cause a script to 
fail, none of them is essential to the core behavior of cat. 
This is why we have described the provenance graph as 
a graph of potential dependencies only. 

A similar fact holds for many programs; almost every 
file (or other input) that is necessary for them to function 
properly is loaded with their image or shortly thereafter. 
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There are notable exceptions: programs such as Apache 
and PERL frequently load modules on-demand; dae- 
mons may reload their configuration files when a HUP 
signal is received, but will rarely reload a library; and 
shell scripts frequently defy all notions of predictability. 

It would appear that the generated graph contains too 
much information for our purposes. Too many “unim- 
portant” edges will make troubleshooting more difficult. 
Thus we need a way to limit the scope of our queries to 
those ancestral objects that are most likely to have con- 
tributed to the behavior or contents of a target descen- 
dant. 


5.1 Statistical Approaches 


There is a statistical approach that will help us rank the 
contribution to dependency made by individual edges, 
full paths, and ancestral subgraphs. 

Consider that any given snapshot of a provenance 
graph represents events as they actually happened. Sup- 
pose that we look at a snapshot of the provenance graph 
generated between time f; and time f2. We see an 
edge from the process /usr/sbin/chpasswd to the file 
/etc/pam.conf. We also see several edges leading 
from other objects to chpasswd. Let us examine what 
we know. We do not track information flow, so we do not 
know what chpasswd did with information that it read 
from pam. conf. We do not know if the process or its de- 
scendants would have functioned correctly if pam. conf 
was missing or contained different content. The graph 
only tells us that the provenance of chpasswd and its de- 
scendants depended upon pam. conf in its current state. 

Let us assume that the process functioned correctly 
during the snapshot period. How do we assign a depen- 
dency rank to edges in the graph? One way might be to 
take multiple snapshots at equally spaced intervals and 
count the number of snapshots in which the edge of 1n- 
terest appears. A high count would indicate a higher like- 
lihood of dependence. While this may seem reasonable, 
it will not work. 

Recall that an object is uniquely identified by a pnode 
number, which remains the same through successive ver- 
sions (and even unto death). Once a node becomes a part 
of the graph, it is never removed. Any edges connected 
to the node remain in the graph as well. Thus, there is no 
difference between snapshots except for the creation of 
nodes and edges, and increases in object versions. 

The correct approach takes advantage of the logical 
separation between provenance objects. A process is the 
running instantiation of a particular program. As such, 
two separate invocations of a program (processes) will 
be assigned distinct pnodes and appear as distinct nodes 
in the graph. Figure 3 shows an example of this scenario. 

Process P; has read file A, written file B and then 
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Figure 3: Processes with distinct pnodes. The program P 
(grouped processes) depends upon A with a ranking of | (thick 
edge). 


terminated. Some time later, process P) takes the ex- 
act same actions. Notice that both processes have a 
provenance edge that points to the program executable 
/bin/P. For all processes that have a given executable as 
input, we can query whether or not they have a particular 
input (A in this case). If the same input object appears in 
the provenance of every process, then we declare that the 
current version of the program depends upon the input 
object with a ranking of 1.0. We denote this by group- 
ing all such processes, drawing an edge from the group 
to the file, and labeling the edge with its rank. Alterna- 
tively, we can merge process nodes into a super-node to 
keep the graph clean. 

There will be cases where only some instances of a 
program read from the same file. In these cases, only 
those instances are grouped and an edge is drawn to the 
file with a dependency rank given by 


# of instances that read 
total # of instances 


We must not apply the same logic to rank the depen- 
dency of files upon programs. We do not know for cer- 
tain what happens to the information that is read from a 
file by a process, e.g., whether it changes the behavior of 
the process. By contrast, a file always depends upon the 
process(es) that created and/or wrote to it. The reason is 
that a file is a passive object whose existence and con- 
tent is governed by processes only. There is never any 
doubt that every bit of information in a file came from 
the process(es) that wrote to ie 


7Note that conceptually, if a process O removes from a file all infor- 
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Figure 4: Dependency ranking for the path from wire-test to Libc-2.4.so. The graph represents a mostly real provenance trace 
but the edge ranks are for demonstration only. If the executable really compiled, it would be difficult to identify the discrepancy 


between the two versions of libc. 


Unnamed pipes are also passive objects that depend 
with certainty upon the process(es) that create them. Like 
processes, every new pipe is identified by a unique pn- 
ode number, and there always exists an edge to the pro- 
cess that created it. It might seem strange to claim that 
a pipe depends upon the processes that write to it, espe- 
cially since we think of it as a simple channel by which 
processes communicate. The information sent to a pipe 
is meant to be consumed by one or more processes dur- 
ing the pipe’s (relatively fleeting) lifetime. Unlike a file, 
none of the information in a pipe persists after it is torn 
down. Nonetheless, except for the timeframe, a pipe is 
serving one of the same purposes as a file; it is an infor- 
mation conduit between two processes. 

Since named pipes are implemented as device special 
files, they remain usable after the process that created 
them exits. Thus we can use the same grouping tech- 
nique to identify all processes that read from the same 
pipe, i.e., special file. With unnamed pipes, we need to 
go a step further. They are only connected between two 
single processes. It does not make much sense to claim 
that a program reads from the same pipe on every in- 
vocation. But we can claim that one program (i.e., every 
process from the same executable) always receives infor- 
mation from another program via a pipe, which implies 
a high-ranking transitive dependency upon the writer by 
the reader. A good way to represent this is to draw a di- 
rected edge from the group of processes to the group of 
pipes. 

Armed with this metric, we can rank the potential de- 
pendence of paths or ancestral subgraphs. A path ranking 
is the average rank of all edges in the path. Similarly, a 


mation written by another process P, we might want to say that the file 
no longer depends upon P. But we have no way of representing this at 
our current level of granularity. 


subgraph ranking is the average rank of all edges in the 
subgraph. For example, Figure 4 shows the rank of the 
highlighted path from the wire-test executable back 
to the ancestral file Libc-2.4.so. We have omitted pro- 
cess grouping for clarity. When we query for the can- 
didates that are likely causes of root problems, our tools 
should suggest exploration of the highest ranking paths 
first (accounting for rank adjustments from rules, filters, 
etc): 

There are three caveats regarding this approach. First, 
it is has yet to be empirically tested. But our knowl- 
edge of operating systems provides a solid foundation. 
Second, the approach requires a bootstrap period during 
which rankings may be heavily influenced by existing 
abnormal behavior. This has the potential to mislead sys- 
tem administrators during analyses. We must therefore 
provide the ability for admins to manually adjust rank- 
ings in the graph, either permanently or via a “what-if” 
mode in a query session. The last warning is that while 
we expect the accuracy of rankings to improve over time, 
a large number of abnormal events may throw certain 
subgraph rankings into chaos at any time. We might be 
able to mitigate this by having the provenance subsys- 
tem alert us to statistical changes that exceed a certain 
threshold. 


5.2 Heuristics 


Statistical methods (and others) will carry us a fair dis- 
tance in compressing the query space. But there is no 
reason to exclude existing knowledge about dependen- 
cies or rules of thumb. We now present several observa- 
tions that will help us improve our rankings and refine 
our queries even further: 
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e Our current rules assign a dependency rank to edges 


based upon how many instances of a program read 
from the same input object. Informally, this says 
that for an edge with a higher rank than another, 
there is a greater chance that the input object affects 
the program’s behavior. 


We might make the assertion that all but the sim- 
plest of daemons will always attempt to open and 
read from their associated configuration files. But 
note that if a daemon accepts a parameter that pre- 
vents loading of config files or specifies a different 
config file than the default (as many do), its input 
edges may receive a much lower rank than expected. 


Whether the daemon is dependent upon a specific 
config file is usually conditioned upon its start- 
ing parameters. Fortunately, PASS includes prove- 
nance about the environment in which a program 1s 
started. If we observe that a daemon always opens 
the same configuration file in the presence of some 
starting parameter, then we will be able to rank de- 
pendencies for different instances of the daemon 
(e.g., when ’-c’ is provided, the daemon always 
loads config file C, but in the absence of ’-c’, the 
daemon always loads config file A.). That is to say 
that we will group processes as usual but the group 
edge rank will indicate how many instances of the 
program read from an input object when started 
with a given parameter. 


The first-order dependencies of many programs are 
known a priori, either via direct experience, docu- 
mentation, or technical detail, e.g. statically-linked 
programs. We can assign a rank of | to the out- 
bound edges of these programs automatically upon 
first invocation. 


Popular objects, as measured by descendant sub- 
graph size, are less likely to be the singular cause 
of a problem. It is a reasonable assumption that if 
a popular object is the cause of a problem, descen- 
dants along more than one path would exhibit unex- 
pected behavior. In the case that we are only seeing 
one or a few objects with unexpected behavior, we 
can have our query engine dynamically reduce the 
dependency ranking for paths or subgraphs that 1n- 
clude popular ancestors. The triggers for such a re- 
duction and the amount of rank reduction will need 
to be determined by experiment. 


Example: almost every program has libc as a core 
library. This means that almost every edge that 


broken or missing, we are likely to know immedi- 
ately. 


Edges to files residing in well-known configuration 
directories or files with well-known names can be 
labeled with a high rank when all other indicators 
are equal or nearly so. 


For example, if a program P opens a file called 
logrotate.conf in directory /etc, then we have 
two more pieces of evidence to support the assertion 
that P depends upon logrotate. conf. The weight 
of this evidence will need to be adjusted according 
to several factors, which 1s left for future work. 


Of course, we must also provide a means by which 
we can fix the dependency or non-dependency of 
an object upon another object. This allows us to 
correct edges in the graph for which our algorithm 
has failed. There may be a semi-automated way in 
which to do this, which is also left for future work. 


Edges to files residing in well-known log directories 
can be labeled with a low rank. 


For example, /var/log/messages is a file that is 
frequently written, but certain log viewing/analy- 
sis/filtering/aggregation tools, such as Splunk, will 
frequently read the file as well.* In many cases, the 
absence or corruption of a log will not affect the 
proper functioning of the reading process. But it is 
difficult to know how far such a failure might prop- 
agate. 


Edges to files residing in well-known temporary di- 
rectories can be labeled with a low rank. 


By definition, programs should not rely upon any 
data stored in a temporary directory (e.g. /tmp). 
However, programs do sometimes use such direc- 
tories to create temporary pipes or to communicate 
information to themselves in the near future. These 
kinds of dependencies will need to be reviewed. 


Edges to files that are created by and opened for 
reading and writing in short intervals and across 
multiple invocations by a single program may be 
safely labeled with a low rank. 


For example, applications such as Emacs create 
backup files during editing. While the user may rely 
upon such backups, Emacs does not require these 
files to function correctly. 


Files that are created/written by an editor like vi are 
not dependent upon vi. They are dependent upon 


points to the libc node will have a dependency 
rank of 1. But this node is uninteresting exactly be- 
cause sO many programs depend upon it. If libc is 


’Many of these tools avoid the local filesystem altogether by log- 
ging to a centralized host via the network. In this case, provenance 
would be captured using network service extensions to PASS. 
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the human being using vi. This is a dependency 
that we can capture because we record the (E)UID 
of every process. If the file is created or written 
via shell redirection, we can still capture the depen- 
dency based upon the shell owner. 


e Files created or modified by a script are dependent 
upon the script and probably many of its ancestors. 
But the path must ultimately lead back to the pro- 
cess that generated the script, whether manual or 
automatic. 


e Any troubleshooting tools we build can integrate 
the use of whitelist, blacklist, Bayesian, and other 
filters. These will give the user flexibility in their 
queries and will certainly encourage use of the tool 
for purposes other than troubleshooting. 


Acting intelligently upon the given observations will 
reduce the size and density of the query space. Note 
that none of our algorithms or heuristics 1s modifying 
the graph. Edge rankings will be applied only at query 
time based upon specified rules and filters, and will be 
computed in a lazy fashion. We do not want to rank one 
million edges for a single query unless it is necessary. 
For example, if a filter limits the query space to files in 
a particular directory, we do not need to rank edges from 
or to files in other directories, nor unnamed pipes. 

As an example of where filtering may fail, suppose we 
determine that a program is behaving abnormally. It has 
file A as input, amongst others. A conventional rule of 
thumb may lead us to filter based upon time; the pro- 
gram was working until a certain point in time, so it is 
reasonable to ask which process most recently wrote to 
A around that time. But this may not help us because at 
the granularity of our provenance, the information that 
was most recently written to A might not be the infor- 
mation that is causing a malfunction. It is possible that 
some previous write is causing a malfunction. Perhaps 
the program did not run during the period between the 
previous write and the most recent write. Thus the ef- 
fect of the previous write to A did not manifest until the 
program was run again. 


6 Under-specified Queries 


Filters and rules will help, but they are not sufficient. 
Even if we assume that the graph contains only actual 
dependencies, we still need the ability to limit the scope 
of under-specified queries. Such a query has the potential 
to return a very large subgraph because it does not suffi- 
ciently constrain ancestral breadth and depth. For exam- 
ple, if we query on the full lineage of /var/log/dmesg, 
we are likely to see all ancestors going back to installa- 
tion of the operating system. Depending upon the con- 


text, this may be unhelpful. The ability to specify queries 
precisely assumes the existence of an excellent mental 
model by which to navigate the provenance graph. As 
the graph expands, “surgical” queries demand a familiar- 
ity that is unsustainable without aid. Thus, our tool needs 
to be able to guess at good places to stop in the lineage 
of a target object. 

Several researchers in our group are attempting to 
tackle this problem based upon ideas inspired by web 
search [23]. Provrank is an algorithm that judges the im- 
portance of objects based upon their frequency across all 
possible lineage queries. Objects with a high frequency 
appear in too many lineage queries. Thus, if some pro- 
cess appears in the query path of every descendant object 
of interest, it does not add any important information to a 
query result and represents a good cutoff point. Another 
metric — frequency dissimilarity — captures the relative 
frequency of an object. That is to say, it measures how 
often an object appears in query results that contain ob- 
jects of the same kind (based upon some criteria). Thus, 
the bash shell will have a lower frequency dissimilar- 
ity in queries that ask for the lineage of mkdir, than in 
queries that ask for the lineage of a random user docu- 
ment (i.e., the bash node would be a good cutoff point 
for queries about user documents). 

Further work is required in this area to help sysadmins 
semi-automatically constrain their queries. 


7 Building Tools 


With a few decent algorithms under our belt, a gaggle of 
heuristics, and a good working knowledge of operating 
systems, what capabilities do we want for our tools and 
their interfaces? 

In our resolver example, we guided the reader through 
a troubleshooting session that uses the provenance graph. 
Although based on a real use case, the example was di- 
rected and abbreviated for clarity. In a real session, a 
sysadmin would need a guide as well; something to im- 
prove their chances of diagnosing the problem. 

Ideally, our tool must be able to present a relatively 
small group of root-cause candidates. But we are also 
helping admins build mental models. We expect to intro- 
duce several interfaces that leverage web technologies as 
well as the familiar command-line interface for conven- 
tional programmatic control. 

Since PQL is the primary method of querying the 
provenance graph, we also plan to introduce a set of pre- 
defined query classes that will help users learn how to 
construct and refine more complex queries. A graphical 
tool is in the works that will enable the construction of 
queries via example as well. 

Finally, integration is paramount. The user must be 
able to build the toolchain with relative ease and con- 
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nect it to existing monitoring and troubleshooting frame- 
works. For example, as problems are solved, relevant 
snippets of the graph and associated queries can be en- 
tered into a trouble ticket system and reviewed in subse- 
quent incidents that exhibit similar symptoms. 


7.1 Visualization 


Provenance graphs can grow to enormous proportions, 
which tends to work against building robust mental mod- 
els. Visualization can dramatically improve the ability 
for users to absorb and understand complex structures. 
As such, it is one of the most important aids to prove- 
nance analysis. 

We have already built a tool called Orbiter [22] that 
can, among other capabilities, display provenance graphs 
with adjustable magnification, perform rudimentary fil- 
tering (e.g., degree, object type, timestamp, etc) and 
querying of ancestors and descendants, and summarize 
subgraphs at customized levels of granularity. We plan to 
extend Orbiter’s capabilities with query subgraph high- 
lighting, regular expression filters, process grouping, an- 
notations, and programmable views. We will encourage 
system administrators to describe the most useful aspects 
of the tool, as well as their thoughts on whether and how 
to eliminate or improve its failings. 


$8 Future Work 


The current implementation of PASS examines prove- 
nance as expressed only via pipes, shared memory 
(mmap), process environments, and the filesystem. Un- 
fortunately, more sources of provenance (and potential 
dependencies) are expressed via other information vec- 
tors, e.g., signals, sockets, message queues, shared mem- 
ory, semaphores, and exit codes. As a result, prove- 
nance graphs generated by our implementation are not 
comprehensive. We believe that analysis of network 
I/O will prove to be a powerful technique. By track- 
ing socket pairs, we can identify dependencies that span 
physical machines. For example, a network-aware ap- 
proach would be able to identify dependencies between 
a web server and a DNS server. Expanding the collection 
and analysis phases in this way will require considerable 
effort. 

Another drawback in our current implementation is 
the inability to collect provenance from root volumes 
or to aggregate provenance from multiple disparate vol- 
umes. We are working to address these shortcomings 
by building a new collection platform [21] based in the 
Xen hypervisor [5] that obtains provenance directly from 
system calls inside of guest VMs. We expect this re- 
orientation to yield new benefits, which include support- 
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ing a better case for adoption than a patched Linux ker- 
nel. 

There are many other technologies that might be em- 
ployed to help build and answer domain-specific trou- 
bleshooting queries, including further analysis of graph 
structure, more advanced statistical techniques, and a 
community-based query database. We also plan to in- 
corporate ideas from machine-learning, not only to help 
conduct semi-automatic analyses of provenance graphs 
and provide better dependency rankings, but to augment 
graphs with information gleaned from interactions of 
system administrators with our tools [10, 24]. 


9 Conclusions 


In our introduction, we made the claim that complete 
and accurate mental models are necessary to most tasks 
performed by system administrators, including trou- 
bleshooting and maintenance. As such, any tool that aids 
in the timely development of accurate mental models will 
be of great benefit to sysadmins at both the junior and se- 
nior level. 

In this paper, we have explored the idea that analy- 
sis of provenance graphs can aid system administrators 
in troubleshooting problems that involve complex hidden 
dependencies. We are confident that if system adminis- 
trators are amenable to automatic provenance collection, 
then this idea will emerge as an effective utility in every- 
day system administration. 
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11 Availability 


A working prototype is not yet available. However, read- 
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1 Remake 


Autotools[Fou09, Foul0Qa] is still very popular as a framework for configuring and building open-source software. 
Since it is a collection of smaller tools, such as autoconf, automake, libtool, and m4, debugging code that it generates 
can be difficult. 


When I wrote my first POSIX shell debugger for bash, one of my initial goals was to be able to debug autotools 
configure scripts, and I was rather pleased when it worked. It required, however, writing a custom bash module to 
read the 20,000 lines of shell script into an array much faster than bash was able to. (This module has since been 
incorporated into bash as built-in function readarray.) It was only after completing this task that I realized a POSIX 
shell debugger was just one part of the bigger problem of debugging autotools script. Here, I describe the next step in 
that endeavor, adding debugging to GNU Make[Fou10b, Berl1]. We will see how to use remake and a POSIX shell 
debugger (the one for bash) together. 


Makefiles have been around for quite a while, and over time, largely through the success of automake, they have gotten 
more complex. Make can be somewhat opaque, but after writing the debugger component of remake, I can usually 
solve make problems very quickly and easily. 


In many programming languages, such as POSIX shell, Perl, Python, Ruby, and Lisp, type expressions or statements 
have interactive shells to see what happens when they run. Although GNU Make is every bit as dynamic as these other 
languages, currently there is no such interactive shell. But the debugger briefly described here can serve as a handy 
substitute. 


The programming language Ruby has a really interesting make equivalent called rake. (If you are writing something 
from scratch, please consider using both Ruby and rake.) But systems administrators often find themselves using tools 
and code written by others, and much open-source software uses make, via automake. Make is so pervasive that the 
reference implementations of Ruby use make to build themselves. 


In keeping with my philosophy of trying to use the smallest hammer that will do the job, this paper shows some of 


the smallest changes of my forked version of GNU Make. When used in conjunction with one of my POSIX shell 
debuggers, you can dynamically debug commands issued by GNU Make into the POSIX shell. 
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1.1 remake —tasks 


A useful feature of Ruby’s rake program is that there is an option to print a list of “tasks” that one can perform. Tasks 
include things such as building the software, installing it, and running the tests. 


In make terminology, tasks are a subset of “files” or “targets.” However in make, we have to distinguish those files 
which are just supposed to be there in the source code from those that somehow get created; and many of the files 
that get created represent intermediate steps along the way to producing something larger. I find it good practice to 
borrow ideas from related tools, and I have added the -—t asks option from Ruby’s rake. This handles files in this 
way: if a target has a command to build it, then it is probably “interesting”; conversely, if there are no commands to 
build a target, that is, it is only listed as a dependency, then it probably is not interesting—it is there only to support 
other targets, and when it changes, it triggers other targets to be remade. Also, if a rule is a default rule of make, then 
itis probably not interesting. This would include things like the pattern rules for compiling a C program or extracting 
something from an archive or source-control system. The same notion of “interesting” is used in debugger stepping. 


Here is remake --—tasks for a typical Makefile system, using the Makefile that comes with the GNU Make 
distribution. 


S remake --tasks 
oC ao 

sO 7 
.dep_segment 
CTAGS 

ChangeLog 


NMakefile 
README: 
dist 
dist-all 
dist—bzipZ 


upload-alpha 
upload=itp 


When I first looked at the output and saw README in this list of targets that have commands associated with them, I 
thought there must be a mistake, because README is usually a distribution file. So I broke out the debugger to check 
what was going on. The answer will become clear below, when I describe how to investigate targets with the debugger. 


Another piece of interesting information we learn from this output is that there is a way to make the ChangeLog file, 
presumably from version control, and a way to make just the bzip2 tarball, or upload the distribution to the alpha and 
FTP sites. 


Additionally, targets can have a description added for them so that they appear when the -—t asks option is given. A 


description must consist of only one line and begins with #:. Here is a Makefile tagged this way: 


#: Build everything 
all: 
perl Build —--makefile_env_macros 1 


+* Creare distribution tarball 
Gist 
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perl Build -—-makefile_env_macros 1 dist 


#: Build and install package 
imsta Li: 
perl Build —--makefile_env_macros 1 install 


#: Create or update MANIFEST file 
manifest: 
perl Build —--makefile_env_macros 1 manifest 


#: Create or update manual pages 


manpages: 
perl Build —--makefile_env_macros 1 manpages 


When run with the -—t asks option we get: 


all # Build everything 
Gist # Create distribution tarball 
install # Build and install package 


manifest # Create or update MANIFEST file 
manpages # Create or update manual pages 
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1.2 remake -trace example 


Here is a real-world example of tracing and then debugging. 


I made a change to the GNU Make source code and I wanted to make a new distribution. I ran the usual command that 
does this: 


S make dist 
NEWS not updated; not releasing 
make: **x [distdir] Error 1 


It is not clear how NEWS needs to be updated, but more information might be available by consulting the rules for the 
distdir target. 


Now let us run remake: 


S remake dist 
NEWS not updated; not releasing 
Makefile:887: xxx [distdir] Error l 


#0 distdir at /tmp/remake/Makefile: 887 
#1 dist at /tmp/remake/Makefile:1004 
Command-line invocation: 

"remake dist" 


This shows additional information: the line number inside the Makefile for target dist dir (887), the target that got 
us to this one, and the command-line invocation. 


With standard make, we could use the name of the target distdir and search inside Makefile for that, but fre- 
quently it might not be in the top-level Makefile. Instead, it might be in some recursive invocation or in a file included 
from the top-level Makefle. The traceback information and file names reduce the detective work needed. 


Now let us run remake with tracing turned on: 


S remake --trace dist 
GNU Make 3.82+dbg-0.7.dev 


Reading makefiles... 
Updating goal targets.... 

/tmp/remake/Makefile:1004 File ‘dist’ does not exist. 
/tmp/remake/Makefile:887 File ‘distdir’ does not exist. 
/tmp/remake/Makefile:887 Must remake target ‘distdir’. 

to be continued... 


The indentation in the lines containing file name and line numbers gives target level nesting: target distdir was 
asked to be remade because it is a dependency of target dist. With this option, we show the dependency nesting as 
we build and traverse the tree. Without it, we show these only at the point of error, if there is one. 


Invoking recipe from Makefile:888 to update target ‘distdir’. 
ito SSS SS SS SS 


USENIX Association LISA 711: 25th Large Installation System Administration Conference 327 


328 


case ‘sed 15q ./NEWS* in \ 
EUS OZ rOOg=0./.dey's) © $F * 
x) \ 
echo "NEWS not updated; not releasing" 1>&2; \ 
exit 1;; \ 
esac 
aS 
NEWS not updated; not releasing 
Makefile:887: xxx [distdir] Error l 


#0 distdir at /tmp/remake/Makefile: 887 
#1 dist at /tmp/remake/Makefile:1004 
Command-line invocation: 

"remake -x dist" 


So finally we get the commands that were run to get that message, indicating which check was performed. 


Another difference between remake and standard GNU Make 1s that remake has lines of the form: 


#H# >>> >>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
a Ae 


These lines serve to separate shell commands about to be run from the output that they produce. 


In brief, the changes to GNU Make to improve tracing and error reporting are: 


The Makefile file name and the line inside this file are reported when referring to a target 
e On error: 


— astack of relevant targets is shown, again with their locations 
— the command invocation used to run make is shown 


— an option allows for entering the debugger on error 


Shell input that is about to be run is separated from the output in running that shell code 


e a-—-tasks option prints a list of interesting targets and any associated description line for each 
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13 remake —debugger example 


The tracing described in the previous section will be enough for some purposes. But we can make the computer do 


more work to show us what is going on by using the built-in debugger. 


Why does README appear when we run rake —-tasks ? We can ask the debugger to describe the target README: 


S remake --—debugger 

GNU Make 3.82+dbg-0O.7.dev 

Reading makefiles... 

Updating makefiles.... 

—> (/tmp/remake/Makefile:477) 
Makefile: Makefile.in config.status 
remake<0> target README 

README: README .template Makefile 

# Implicit rule search has not been done. 
# Implicit/static pattern stem: ‘README’ 
# Modification time never checked. 

# File has not been updated. 

# Commands not yet started. 

# automatic 

# @ := README 

# automatic 

< := README.template 

automatic 


+- + 


# commands to execute (from ‘Makefile’, line 1329): 
rm -f£ S@ 
sed -e ’S@SVERSIONS@S (VERSION) @g’ \ 
-e 's@SPACKAGE%S@S (PACKAGE) @g’ \ 
os > $@ 
chmod a-w S@ 
remake<1> 


The file README is created from README.template. In the commands section, there are a number of expanded 
variables such as $@ and $<. Earlier though, the values of the automatic variables @ and < are shown; here they are 
README and README.template respectively. If, however, we want remake to do the expansion when showing the 


commands, there is an option to the target command for that: 


remake<1l> target README expand 
README : 
# commands to execute (from ‘Makefile’, line 1329): 
rm =f README 
sed -e ’s@SVERSIONS@3.82+dbg-0.7.dev@g’ \ 
-e 's@SPACKAGES@remake@g’ \ 
README.template > README 
chmod a-w README' 
remake<1> 


Although is is not immediately apparent, some expansion was done in showing the target and dependencies. Line 


1328 in file Makefile looks like this: 
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S(TEMPLATES) : % : %.template Makefile 
The debugger command expand can be used to get the expanded value of the variable TEMPLATES: 


remake<2> expand TEMPLATES 
Makefile:1319 (origin: makefile) TEMPLATES := README README .DOS 


Now we return to tracking down what was happening when we tried to run make dist. Again we go into the 
debugger: 


S remake --debugger dist 
GNU Make 3.82+dbg-0.7.dev 


Reading makefiles... 

Updating makefiles.... 

—> (/tmp/remake/Makefile: 477) 
Makefile: Makefile.in config.status 


It appears that the first thing that is done is to check whether the Makefile itself is up to date. As before, we could list 
information from the target that we crashed on, distdir. However instead let us run until the target: 


remake<0O> continue distdir run 
Breakpoint 1 on target distdir: file Makefile, line 887. 
Updating goal targets.... 

/tmp/remake/Makefile:1004 File ‘dist’ does not exist. 
/tmp/remake/Makefile:887 File ‘distdir’ does not exist. 
(/tmp/remake/Makefile: 887) 

distdir 


There are three interesting points in time when updating a target: 


1. before checking dependencies of the target 
2. after checking but before running commands to update the target 


3. after runnning commands when the target update is finished 


Adding run to the end of continue distdir causes us to stop after dependency checking. 


The debugger first stopped before dependency checking, as shown by an icon, the two-character arrow —>, so it lists 
the dependencies for the target. For the Makefile target, they were Makefile.inand config.status. After 
continuing, it next stops dependency checking, so dependencies of the target are not automatically shown, unless 
explicitly requested with target just as for the commands. 


A common problem in designing this kind of tool is trying to figure out how to cut down the amount of information 
shown. We usually do not want a list of all dependencies for distdir here since that would include a list of all of 
the files in the distribution. With the -—tasks option above, files without associated commands are dropped from 
the listing. 


Another indication that the debugger stopped after dependency checking is that the two-character icon is .. rather 


than —>. I try to use analogous gdb commands when possible. Here, the gdb-like command info program makes 
the stopping place more explicit: 
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remake<3> info program 
Starting directory ‘/tmp/remake’ 
Program invocalion: 
remake -X distdir 
Recursion level: 0 
Line 887 of "/tmp/remake/Makefile" 
Program is stopped after rule-prerequisite checking. 


At this point we can list the commands that are to be run next using the target command, which shows information 
regarding a target. We will use variables that have been set up by GNU Make when giving a target name. As we saw 
when listing variables for README, @ is an automatically set variable containing the name of the current target. Since 
we have run to target distdir, @ 1s set to that. 


remake<1l> target @ commands 


Ga stdi1: 

# commands to execute (from ‘Makefile’, line 888): 

@case ‘sed 15q S(srcdir)/NEWS* in \ 

*"S (VERSION) "«) : 33 \ 

x) \ 

echo "NEWS not updated; not releasing" 1>\&2; \ 

exit 1;; \ 

esac 

@list="S (MANS)? * 2£& test =n: “SSlzase"=. then \ 

list="“for p in scilisce do \ 

i: (est ~— Sop; Ehen c=) else d="6i(srcoir) 7"; £i;p \ 

if Gest. =£ “SogsSo"; then echo "Segsso";> Glee 2 rise done*s 4 
about 90 other lines. 


Makefile commands can be confusing because there are two sources for variables: GNU Make variables and POSIX- 
shell variables. Here we see things like $ (VERSION) which is a GNU make variable and $$p which is the POSIX- 
shell variable $p. An extra $ needs to be added in the Makefile. We can ask the debugger to expand all of the Makefile 
variables, but instead, let us write this code out to a file using the write command: 


remake<2> write 
File "/tmp/distdir.sh" written. 


We can use the bash debugger bashdb to debug the rest. 


remake<3> quit 

remake: That’s all, folks... 

S bashdb /tmp/distdir.sh 

bash debugger, bashdb, release 4.2-0./7 


(/tmp/distdir.sh:4): 

4: case ‘sed 15q ./NEWS* in \ 
bashdb<3> step 
(/tmp/distdir.sh:4): 

4: case ‘sed 15q ./NEWS* in \ 
sed 15q ./NEWS 
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If we do not know what sed 15q ./NEWS does, rather than look this up in a manual, we can let the debugger 
show us. The parentheses in the bashdb prompt mean that we are inside a subshell, the backtick part of ‘sed 15q 
«/ NEWS *; 


A useful command I added not too long ago to the debuggers is eval without any arguments. Here, it takes the line 
that is about to be run and runs it. 


bashdb< (4)> eval 
eval: sed 15q ./NEWS 
Version 3.82+dbg-0.6 
GNU make NEWS 
History of user-visible changes. 
23 July 2010 


One more step and we go to where we do not want to be: 


bashdb<(5)> step 
(Jimp/ dis cdair,.shie7)< 


a echo "NEWS not updated; not releasing" 1>&2; \ 
bashdb<6> list 
a #/tmp/remake/Makefile: 887 
os #cd /tmp/remake 
4; case ‘sed 15q ./NEWS* in \ 
5% AUS wOZTOog=0 s/sder"s=): +. 23° 4 
6: x) \ 
7: => echo "NEWS not updated; not releasing" 1>&2; \ 
os exit Tiss \ 
oR esac 
LO @list=’make.1’; if test -n "Slist"; then \ 
LL liBt="LoOr Dm on Slisty do. x 
bashdb<i7/> 


What is wrong is that we were looking for 3. 82+dbg+0 .7dev inside the first 15 lines of the file NEWS and we did 
not find that. 


The above example barely scratches the surface of what is available in both my GNU Make debugger and my 
POSIX shell debuggers. There is extensive help inside the debuggers and in the online manuals http: //bashdb. 
sourceforge.net/remake/remake.html/index.html and http://bashdb.sourceforge.net/ 
bashdb.html. 
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1.4 History and Acknowledgments 


The idea for a GNU Make debugger came about after I had completed a debugger for bash[Ber09] and realized that 
there was much more to debugging distribution building in autoconf and automake scripts than just the configure 
script. So I first floated the idea in freshmeat forum[McC03]. A year later, in response to a challenge[Sm104], I wrote 
the first code without much trouble. 


GNU Make already had a wealth of debugging information stored, so all that was needed was to keep track of a 
dependency stack and add calls to a REPL (read, eval, print loop) at appropriate times. Delving into the code to figure 
out the right times and places was the bulk of the hard work. 


One suggestion is to display a tree or subtree of targets, possibly as a graph. Unfortunately, GNU Make does not save 
a tree of targets. Instead, it grows the branch it needs as it traverses targets and removes it afterwards. In order to 
provide debugging, I had to extend the code to save information from the current target back to the goal target. 


So, some target actions can affect whether subsequent targets are up-to-date or not. To make things more complex, 
targets can be patterns that dynamically match the files created at run-time, and short of “building” the code, one can 
only give an approximation of existing dependencies. 


I would like to thank Calyxa D. Tokay for her constant encouragement, and Stuart Frankel for turning my jumble 
of ideas into a slightly more coherent and well-organized paper. Tthe anonymous reviewers’ comments were very 
helpful. 


1.5 Availability 


The home page for this project is http: //bashdb.sourceforge.net/remake/. Download links for source 
code can be found there. 


Yaroslav Halchenko has been providing Debian packages. The git source repository is at: 
https: //qithub.com/rocky/remake. 
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